Drawing conclusions from data is difficult because we almost never have complete information about any variable of interest. For instance, we may have only been able to send a satisfaction survey to a small subset of our customers.
The population of a study consists of all possible observations of interest.
A sample in a study consists of all the observations we actually have.
We are almost always interested in a population, but only have information about a sample. Statistical inference is the process of reasoning from the one to the other.
A parameter is a number that describes a population
A statistic is a number that describes a sample
The mean of the eye difference variable in our optical dataset is -0.4939496. The sample mean is found by adding together all the values of a variable in the sample data, and dividing this total by \(n\), the sample size.
If we extract 100 samples with three observations each from our data and visually show their means we can see every time we got a different value, sometimes they will be very close to the target parameter, but sometimes they can be quite far.
Appt Date
Patient Name
DOB
Age
IsPrivate
IsSubsidised
IsSmoker
IsDriver
TakingMedication
Left Eye Value
Right Eye Value
Optician First Name
Optician Last Name
eye_difference
2019-08-20T22:16:53Z
Carl Walker
1960-04-19T00:00:00Z
60
1
0
0
0
yes
3.843656
4.991663
Elizabeth
Montgomery
-1.148007
2019-07-24T13:44:16Z
Aaron Wheeler
1958-02-15T00:00:00Z
62
1
1
1
1
yes
4.359093
1.861221
Shirley
Evans
2.497872
2019-01-22T23:39:12Z
Martin Franklin
1993-12-17T00:00:00Z
26
0
0
1
1
no
4.522499
1.495162
Kimberly
Cook
3.027337
2019-11-06T05:03:04Z
Thomas Parker
2000-03-11T00:00:00Z
20
1
1
0
0
no
1.653130
5.302994
Kenneth
Griffin
-3.649863
2019-07-20T15:25:28Z
Joe Ward
1970-03-06T00:00:00Z
50
1
1
0
1
yes
1.148688
5.438031
Kimberly
Cook
-4.289344
2019-06-26T12:36:19Z
Kenneth Bradley
1941-02-26T00:00:00Z
79
1
0
1
1
no
2.749662
1.294820
Sara
Alexander
1.454841
Mean | SD | Median | MAD | Min | Max | n_Obs | Skewness | Kurtosis | percentage_Missing
----------------------------------------------------------------------------------------------
-0.49 | 1.84 | -0.51 | 1.98 | -4.96 | 3.90 | 10000 | 0.02 | -0.64 | 0.00
[1] 0.4326991
Variability is the natural tendency for a statistic to be different across different samples.
A Biased statistics, on the other hand, misses the mark on average, for example if we increase the value of all our sample means by 1:
Reproducibility and Replicability
Replicating means to get the same conclusion with slightly different samples, procedures, and data analysis methods.
Related to that is the problem of reproducibility. That means to get the same results then you simply use the same data and same methods that were claimed in the analysis.
Shortcomings of statistical analysis *
Statistical inference can handle variability, but not bias. Bias comes from incorrect data collection and only good data collection and management practices can prevent bias.
Statistical inference can lead to wildly wrong conclusions when misused or abused for example:
Sample bias: not choosing the sample randomly.
Undercoverage: your procedure miss out keep portions of the population.
Overgeneralization: you need to limit the conclusions that you draw to the population that you have sampled from.
Significance test are particulary easy to abuse:
If you run a test just because you’ve spotted a pattern in your data, you’re almost certain to get a positive result even if the trend is just due to random chance. Patterns exist just by chance.
p-hacking: repeated testing of the same hypothesis will eventually yield a positive result just by random chance.
Confusing significance for importance. Just because a difference is significant does not mean it is important.
No amount of statistical savvy can fully remediate bad data. All of the techniques we’ll discuss in this course are designed to address variability in data, not bias. Before attempting any sort of analysis, you should always take some time to think critically about the overall quality of the data. A few questions to consider:
Is the sample representative of the population of interest, or might some subgroups be under or over represented?
How was the data collected? Are any of the variables self-reported? Were there points in the process where those collecting the data made subjective judgments?
How will conclusions from the data analysis be used? How might they be misinterpreted or abused?
2 Expected Value and Standard Error:
If we are interested in the average height of adult males in USA and we extract an adult male at random, we expect his height to be around the population average, give or take about one standard deviation \(\sigma\) . The expected value is the population average\(\mu\) . The expected value of the sample average is still the population average, but remember that the population mean is a random variable because sampling is a random process, so its mean won’t be exactly the population mean. How far off can we expect this sample mean to the population mean?
The standard error (SE) of a statistic tell you roughly how far off the sample mean will be from its expected value (population mean). It indicates how much the sample mean is expected to vary from the true population mean.
where n is the sample size. We can actually use this formula to decide what is the sample size we need to use for a desired accuracy in our statistic.
It plays a similar role as the standard deviation for one observation from the population mean, remember that the Standard deviation measures the dispersion of individual data points around the mean of a dataset. It tells you how spread out the values in a dataset are. In summary, the standard error of an estimate is the standard deviation of the sampling distribution of an estimate.
Exercise:
A town has 10,000 registered voters, of whom 6,000 are voting for the Democratic party. A survey organization is taking a sample of 100 registered voters (assume sampling with replacement). The percentage of Democratic voters in the sample will be around _____, give or take ____. (You may use the fact that the standard deviation is about 0.5)
The percentage of Democratic voters in the sample is equal to the mean of the sample, and so the “around” value is the expected value of the sample mean, which in turn is equal to the population mean, 0.60 or 60%. Similarly, the “give or take” value is the standard error (SE) of the mean, which is equal to \(\frac{\sigma}{\sqrt{n}}\) , where σ is the standard deviation of the population and n is the size of the sample. We’re told that σ is about 0.5 and n is 100, so the standard error of the mean is 0.05 or 5%.
2.0.1 Expected value and standard error for the sum
If we are interested in the sum of n draws \(S_n\), the sum and the average are related by \(S_n = n\bar{x_n}\) so the standard error of the sum: \[
SE(S_n)=\sqrt{n}\sigma
\tag{2}\]
which tells us that the variability of the sum of n draws increases at the rate of \(\sqrt{n}\) so while increasing the sample size reduces the standard error of the average, it actually increases the standard error for the sum.
Exercise1 (Expected Value):
You solicit 100 pledges for a charitable organization. Each pledge is equally likely to be $10, $50, or $100. You may use the fact that the standard deviation of the three amounts $10, $50 and $100 is $37.
What is the expected value of the sum of the 100 pledges?
The expected value of the sum of a sample is nμ, where n is the size of the sample and μ is the mean of the population. Here we’re told that n is 100 and μ must be the average of $10, $50, and $100, which is $53.33, so the expected value of the sample sum is $5333.
Exercise2 (Standard error for the sum)
You solicit 100 pledges for a charitable organization. Each pledge is equally likely to be $10, $50, or $100. You may use the fact that the standard deviation of the three amounts $10, $50 and $100 is $37.
What are the chances that the 100 pledges total more than $5,700 ?
The standard error of the sample sum is \(SE(S_n)=\sqrt{n}\sigma\) where n is the sample size and σ is the standard deviation of the population.
Since n is 100 and we’re told that the standard deviation of the population is $37, we know the standard error of the sample sum is $370.
Using the normal approximation to the sampling distribution, this tells us that $5700 is roughly one standard deviation (5700 - 5333 = 366) above the mean of the sampling distribution. From the empirical rule, we know that one half of 68% of the data lies within one standard deviation to the right of the mean, so that 50 + (1/2)68 = 84% of the data lies to the left of $5700. That is, 16% percent of the data lies to right of $5700.
2.0.2 Expected value and standard error for percentages.
For percentages, the expected value is \[E = \mu * 100\% \tag{3}\] and the Standard Error \[SE=\frac{\sigma}{\sqrt{n}} *100\% \tag{4}\]
All the above formulas are for sampling with replacement, but they are still approximately true when sampling without replacement if the sample size is much smaller than the size of the population.
Exercise:
We toss a coin 100 times. How many tails do you expect to see?
we assing tails a value of 1, and heads a value of 0.
p(1)=0.5, p(0)=0.5
Number of expected tails = \(E(sum) = 100 * \mu\)
and \(\mu = 0*p(0)+1*p(1)\) , so \(\mu =0.5\) and \(E= 100*0.5\) =50 tails.
The Standard Error will be \(SE(sum) = \sqrt{100}*\sigma\)
to calculate sigma, we know that \(\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2}\) where:
(\(\sigma\)) is the standard deviation,
(\(N\)) is the total number of observations,
(\(X_i\)) represents each individual observation,
(\(\mu\)) is the mean of the observations.
In our exercise, without having to toss the coing 100 times we calculate it like this:
so our answer is that we expect to see 50 tails, give or take 5 tails.
and in percentages we would expect to see E= 0.5*100 =50% tails with an standard error of SE =0.5/10*100 = 5%
in r code, an example of the same problem with n=200 tosses. See the difference in the percentages.
Code
# Define the probabilities p <-c(0.5, 0.5) # Define the values (0 for heads, 1 for tails) x <-c(0, 1) # Calculate the mean (μ) mu <-sum(x * p) # Calculate the variance (σ^2) variance <-sum((x - mu)^2* p) # Calculate the standard deviation (σ) sigma <-sqrt(variance) # Number of coin tosses n <-200# Calculate the standard error (SE) SE <-sqrt(n) * sigma cat("Expected percentage of tails:", n * mu, "\n")
Expected percentage of tails: 100
Code
cat("Standard error:", SE, "\n")
Standard error: 7.071068
Code
cat("Expected number of tails:", mu*100, " % \n")
Expected number of tails: 50 %
Code
cat("Expected standard error percentage:", sigma /sqrt(n), "% \n")
Expected standard error percentage: 0.03535534 %
2.1 The sampling distribution
If we toss a fair coin 100 times, we can get any number of tails from 0 to 100. How likely is each outcome?
The number of tails has the binomial distribution with n=100 and p=0.5 where success = tails. So if the statistic of interest is \(S_n=number\ of \ tails\) , then \(S_n\) is a random variable whose probability histogram is given by the binomial distribution. This is called the sampling distribution of the statistic\(S_n\). The sampling distribution provides more detailed information about the chance properties of \(S_n\) than the summary numbers given by the expected value and the standard error alone.
There are three histograms to take into consideration:
2.1.1 Probability histogram
The probability histogram for producing the data. Since both tails and heads have the same probability, our histogram would be two columns with the same height.
If we toss the coins, we will have am empirical histogram
2.1.3 Probability histogram of the sample
And finally we can do the probability histogram of the statistic \(s_{100}\), which shows the sampling distribution:
3 Central limit theorem (CLT)
The Central Limit Theorem is one of the most important results in statistics, basically saying that with sufficiently large sample sizes, the sample average approximately follows a normal distribution. This underscores the importance of the normal distribution, as well as most of the methods commonly used which make assumptions about the data being normally distributed.
Let’s stop and think about what it means for the sample average to have a distribution. Imagine going to the store and buying a bag of your favorite brand of chocolate chip cookies. Suppose the bag has 24 cookies in it. Will each cookie have the exact same number of chocolate chips in it? It turns out that if you make a batch of cookies by adding chips to dough and mixing it really well, then putting the same amount of dough onto a baking sheet, the number of chips per cookie closely follows a Poisson distribution. (In the limiting case of chips having zero volume, this is exactly a Poisson process.) Thus we expect there to be a lot of variability in the number of chips per cookie. We can model the number of chips per cookie with a Poisson distribution. We can also compute the average number of chips per cookie in the bag. For the bag we have, that will be a particular number. But there may be more bags of cookies in the store. Will each of those bags have the same average number of chips? If all of the cookies in the store are from the same industrial-sized batch, each cookie will individually have a Poisson number of chips. So the average number of chips in one bag may be different from the average number of chips in another bag. Thus we could hypothetically find out the average number of chips for each bag in the store. And we could think about what the distribution of these averages is, across the bags in the store, or all the bags of cookies in the world. It is this distribution of averages that the central limit theorem says is approximately a normal distribution, with the same mean as the distribution for the individual cookies, but with a standard deviation that is divided by the square root of the number of samples in each average (i.e., the number of cookies per bag).
When sampling with replacement and \(n\) is large, then the sampling distribution of the sample average approximately follows the normal curve.
For example, going back to the example of of the online gambling where we had a probability of winning a small prize with \(p=0.2\), we can see how the probability histogram for the binomial distribution approximates more and more to the normal distribution as we increase the number of trials \(n\).
That means that we can use normal approximation to compute probabilities. The key point of the theorem is that we know that the sampling distribution of the statistic is normal no matter what the population histogram is.
The mathematical theory behind that:
The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average \(\bar{y}\) of a random sample follows a normal distribution centered at the population average μ and with standard deviation equal to the population standard deviation σ, divided by the square root of the sample size N. We refer to the standard deviation of the distribution of a random variable as the random variable’s standard error. We have seen this formula before when studying the Standard Error (Equation 1): \[
\text{SE} (\bar{x_n})= \frac{\sigma}{\sqrt{n}}
\]
This is telling us that when we take more samples and plot their results in a histogram, the spread of the histogram becomes smaller (closer to the population average) as we saw in the histograms above.
Standardized sample mean
This implies that if we take many samples of size \(N\), then the quantity is approximated with a normal distribution centered at 0 and with standard deviation 1
\(\bar{Y}\): The sample mean, which is the average of your sample data.
\(\mu\): The population mean, which is the average of the entire population from which the sample is drawn.
\(\sigma_Y\): The population standard deviation, which measures the spread of the population data.
\(N\): The sample size, or the number of observations in your sample.
The standardized sample mean is a way to transform the sample mean into a standard normal distribution (mean 0, standard deviation 1). Standardizing the sample mean by subtracting the population mean and dividing by the standard error \(\sigma_Y/\sqrt{N}\) allows us to compare it to the standard normal distribution. This is useful because the properties of the normal distribution are well understood and widely applicable in statistical inference.
Central limit theorem in practice. The central limit theorem allow us, when we don’t have access to the population data, to use the normal approximation so we can compute \(p\)-values.
We are going to use the mice bodyweight data for our example. We willwork only with female mice because the bodyweight of female and male mice are different. We calculate the mean and the standard deviation of control and treatment groups and those will be the parameteres of our population.
The CLT tells us that when the samples are large, the random variable is normally distributed. Thus we can compute \(p\)-values using the function pnorm
Code
dat <-read.csv("data/mice_pheno.csv") table(dat$Sex, dat$Diet)
Remember that in practice we do not get to compute these population parameters. These are values we never see. In general, we have to estimate them from samples.
As we described, the CLT tells us that for large N, each of these is approximately normal with average population mean and standard error population variance divided by N. We mentioned that a rule of thumb is that N should be 30 or more. However, that is just a rule of thumb since the preciseness of the approximation depends on the population distribution. Here we can actually check the approximation and we do that for various values of N.
We are going to create 10,000 samples of N number of mice for N= 3,12,25 and 50 and calculate the difference for the means between treatment and control bodyweight. We then calculate the t-statistic for each sample size (t-statistic is explained later)
Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.This ratio is what we call the t-statistic.
\[
t = \frac{\bar{x} - \mu}{SE}
\tag{6}\] where \(SE = \frac{s}{\sqrt{n}}\)
In the specific case of our experiment (comparing the means of two samples) this translates into:
Finally we will plot a QQ plot for each sample size against the normal distribution to see their fit:
Code
Ns <-c(3,12,25,50)B <-10000#number of simulations##function to compute a t-statcomputetstat <-function(n) { y <-sample(hfPopulation,n) x <-sample(controlPopulation,n) (mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)}res <-sapply(Ns,function(n) {replicate(B,computetstat(n))})par(mfrow=c(2,2))for (i inseq(along=Ns)) {qqnorm(res[,i],main=Ns[i])qqline(res[,i],col=2)}
So we see that for \(N=3\), the CLT does not provide a usable approximation. For \(N=12\), there is a slight deviation at the higher values, although the approximation appears useful. For 25 and 50, the approximation good.
This simulation only proves that \(N=12\) is large enough in this case, not in general.The further the population distribution is from the normal distribution, the higher the sample size we will require.
For populations that are already normally distributed, even small sample sizes will result in sample means that are approximately normally distributed. For populations that are not normally distributed, larger sample sizes are required for the sample means to approximate a normal distribution.
We will see later that we don’t usually calculate our \(p\)-values like this using the CLT, instead we will use a t-test that will give us a slightly different result. This is to be expected because our CLT approximation considered the denominator of t-stat practically fixed (with large samples it practically is), while the t-distribution approximation takes into account that the denominator (the standard error of the difference) is a random variable. The smaller the sample size, the more the denominator varies.
When does the central limit theorem apply? For the normal approximation to work, the key requirements are:
we sample with replacement, or we simulate independent random variables from the same distribution. (If the sample size is much smaller than the population size, then sampling with replacement and sampling without replacement is approximately the same so this also applies)
The statistic of interest is a sum, average or percentage.
The sample size is large enough. The more skewed the population histogram is, the larger the required sample size n. If there is no strong skewness, then n=15 is sufficient.
Exercise:
Suppose we are interested in the proportion of times we see a 6 when rolling n=100 dice. This is a random variable which we can simulate with x=sample(1:6, n, replace=TRUE) and the proportion we are interested in can be expressed as an average: mean(x==6). Because the die rolls are independent, the CLT applies.
We want to roll n dice 10,000 times and keep these proportions. This random variable (proportion of 6s) has mean p=1/6 and variance \(\frac{p(1-p)}{n}\). So according to the CLT: \[
z = \frac{observed_p - p}{\sqrt{(p\times(1-p))/n)}}
\]
\(z\) should be normal with mean 0 and SD 1.
Perform the simulation, and report what proportion of times \(z\) was larger than 2 in absolute value (CLT says it should be about 0.05).
There are two fundamental sorts of questions we might ask our sample data:
How can we estimate a parameter while taking into account the variability of the data in an honest way?
How can we know if our data supports a specific conclusion, given its inherent variability?
The first of these questions leads to the idea of a confidence interval, where we specify not only an estimate of a parameter but also a reasonable margin of error. The latter question gives rise to significance testing.
Use a confidence interval if you wish to estimate a parameter from a sample in a way that describes not only the observed mean but also the uncertainty surrounding it.
Use a significance test if there is one particular value of interest, for instance representing whether a piece of machinery is properly calibrated.
Significance tests and confidence intervals are two sides of the same coin.
Significance tests consider specific individual values, while confidence intervals give more general information about the parameter of interest. The latter are more flexible and are generally easier to interpret. Consider a significance test only when there is a single parameter value of interest which you could identify before looking at the data.
5 Confidence intervals
A confidence interval is a way of reporting an estimate of a parameter that includes information about how much variability could reasonably be expected due to random chance. A confidence interval for the mean of a quantitative variable has the form
\[
\mu = \bar{x} \pm ME
\]
where \(\mu\) is the population mean, \(\bar{x}\) is the observed sample mean and \(ME\) is a margin of error.
5.1 Manual calculation:
A confidence interval gives a range of plausible values for a population parameter. Usually the confidence interval is centered at an estimate for \(\mu\) which is an average. Since the central limit theorem applies for averages, the confidence interval has a simple form:
\[
CI = estimate \mp ME
\]
where ME is the margin of error and can be decomposed like this:
where SE is the standard error of the sample. If that is an average or a percentage then it is \(SE= \frac{\sigma}{\sqrt{n}}\) as we saw in (Equation 1).
And \(z\) is the \(z\)-score corresponding to the desired confidence level. For example, if we would like to have a 95% confidence level, our \(z=1.96\)
This comes from calculating the value of \(z\) where 95 of our data is in the middle:
To get to that \(z\) value we use tables or we can use the quantile function qnorm:
Code
# Calculate the z-score for a 95% confidence intervalz_score <-qnorm(0.975)z_score
[1] 1.959964
The 0.975value is used because for a 95% confidence interval, you need to capture the central 95% of the distribution, leaving 2.5% in each tail. Therefore, you look up the 97.5th percentile (0.975) to get the \(z\)-score.
For 90% confidence interval we are looking for the \(z\) value in the normalized distribution where 90% of the data falls in our center range and 10% outside, so 5% in each tail.
Code
# Calculate the z-score for a 95% confidence intervaldesiredConfidence <-90tails = (100-desiredConfidence )/2percentileOfinterest <- desiredConfidence+tailsz_score <-qnorm(percentileOfinterest/100)z_score
[1] 1.644854
For a 99 confidence level:
Code
# Calculate the z-score for a 95% confidence intervaldesiredConfidence <-99tails = (100-desiredConfidence )/2percentileOfinterest <- desiredConfidence+tailsz_score <-qnorm(percentileOfinterest/100)z_score
[1] 2.575829
Confidence Intervals for Coin Flips
Example 1: Flipping a Coin 100 Times
In this example of flipping a coin 100 times, we observed 44 heads, resulting in the following 95% confidence interval for \(p\):
\[
CI= \hat{p} \pm z \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}
\]
where: - \(\hat{p}\) is the observed proportion of heads - \(z\) is the critical value from the standard normal distribution for a 95% confidence level (approximately 1.96) - \(n\) is the number of trials (100 in this case) $ p = $
From this, we concluded that it is plausible that the coin may be fair because \(p = 0.5\) is in the interval.
Example 2: Flipping a Coin 100,000 Times
Suppose instead that we flipped the coin 100,000 times, observing 44,000 heads (the same percentage of heads as before). Then using the method just presented, the 95% confidence interval for \(p\) is:
Code
# Example 2: Flipping a Coin 100,000 Timesn2 <-100000x2 <-44000p_hat2 <- x2 / n2# Confidence interval calculation(ci_lower2 <- p_hat2 - z *sqrt(p_hat2 * (1- p_hat2) / n2))
Is it reasonable to conclude that this is a fair coin with 95% confidence? Based on this narrower confidence interval, it appears less likely that the coin is fair since \(p = 0.5\) is not within the interval.
Estimating the SE with bootstrap principle
We still have a problem for calculating the confidence interval following this, and it is that we need to know the standard deviation of the population \(\sigma\) and we usually don’t know it, but the bootstrap principle states that we can calculate sigma by its sample version \(s\) and still get an approximately correct confidence interval.
Example: We poll 1000 likely voters and find that 58% approve of the way the president handles his job.
\(SE= \frac{\sigma}{\sqrt{n}} * 100\) where \(\sigma = \sqrt{p(1-p)}\) where \(p\) is the proportion of all voters who approve, but we don’t know p, but the bootstrap principle tells us that we can replace \(\sigma\) by \(s\) so we can plug in the values from our survey here: \(=\sqrt{0.58(1-0.58)} = 0.49\)
So a 95% confidence interval for \(p\) is: \[
58\% \mp 1.96 \frac{0.49}{\sqrt{1000}}*100
\]
which is approximately [54.9%,61.1%]
The width of the confidence interval is determined by \(z\) and the standard error SE. To reduce that margin of error we have two options, we can increase the sample size or decrease the confidence level. The sample size is square rooted so this means that to cut the width of the confidence level in half we need four times the sample size, and to reduce it 10 times we would need 100 times the sample size.
5.2 Getting the confidence intervals from a test result
Now we can calculate the confidence intervals for other 90% and 99% level of confidence and see their differences in a graph:
Code
#level of confidence 90%. testResult <-t.test(optical_sample$eye_difference,conf.level = .90)confInt90_low <- testResult$conf.int[1]confInt90_upp <- testResult$conf.int[2]#level of confidence 99%. testResult <-t.test(optical_sample$eye_difference,conf.level = .99)confInt99_low <- testResult$conf.int[1]confInt99_upp <- testResult$conf.int[2]ggplot(optical_sample, aes(x = eye_difference)) +geom_histogram() +geom_vline(xintercept =mean(optical_sample$eye_difference),linetype ="dashed")+geom_vline(xintercept = confInt90_low,linetype ="dotted", color ="orange")+geom_vline(xintercept = confInt90_upp,linetype ="dotted", color ="orange")+geom_vline(xintercept = confInt95_low,linetype ="dotted", color ="green")+geom_vline(xintercept = confInt95_upp,linetype ="dotted", color ="green")+geom_vline(xintercept = confInt99_low,linetype ="dotted", color ="yellow")+geom_vline(xintercept = confInt99_upp,linetype ="dotted", color ="yellow")
Three factors feed into the size of the margin of error:
The confidence level of the interval. That is, the proportion of the time our interval would correctly capture the parameter of interest. Higher confidence requires a larger margin of error.
The spread of the observations in the data set. More spread in the data implies more sampling variability and therefore a larger margin of error.
The size of the sample.
Exercises:
Using the pm_sample data set,
Find a 95% confidence interval for the mean salary of the project managers in this population. Interpret the results in plain language. Would a 99% confidence interval be wider of more narrow?
Find a 95% confidence interval for the man non-salary compensation of project managers in this population. Why is the margin of error different than in the first problem?
Mean | SD | Median | MAD | Min | Max | n_Obs | Skewness | Kurtosis | percentage_Missing
------------------------------------------------------------------------------------------------------------------
87653.22 | 32751.30 | 82000.00 | 24907.68 | 28000.00 | 2.80e+05 | 221 | 2.02 | 7.01 | 0.00
Code
t.test(pm_survey$annual_salary)
One Sample t-test
data: pm_survey$annual_salary
t = 39.786, df = 220, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
83311.35 91995.08
sample estimates:
mean of x
87653.22
The confidence interval at the (default) 95% is 83311.3539636, 91995.0804255 which means in plain language that if we were to repeat the sampling and the test multiple times, this confidence interval would capture the true value of the parameter 95% of the time. Although not extrictly true, we can say that there’s a 95% chance that this confidence interval includes the true population mean.
Mean | SD | Median | MAD | Min | Max | n_Obs | Skewness | Kurtosis | percentage_Missing
----------------------------------------------------------------------------------------------------------
5353.81 | 9676.19 | 1500.00 | 2223.90 | 0.00 | 70000.00 | 221 | 3.32 | 13.78 | 0.00
Code
t.test(pm_survey$other_monetary_comp)
One Sample t-test
data: pm_survey$other_monetary_comp
t = 8.2253, df = 220, p-value = 0.00000000000001693
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
4071.026 6636.585
sample estimates:
mean of x
5353.805
our confidence interval is 4071.0255904, 6636.5852693 and so, the margin of error in both examples are:
The difference in both is due to the variability of the data, that we can calculate using the standar deviation:
Code
sd(pm_survey$annual_salary)
[1] 32751.3
Code
sd(pm_survey$other_monetary_comp)
[1] 9676.192
5.2.1 Interpreting confidence intervals: practical example
We are going to calculate the confidence interval for 100 samples of 30 observations calculated over the same population and plot it on a graph with a line marking the true mean. We will see how some of them (for a 95% confidence level it should be around 5% of the times) will still miss the population parameter:
Code
#calculate the sample mean and confidence interval from a sample of 100 low =numeric()high =numeric()for (n in1:100){ sample <-slice_sample(optical, n =30) test <-t.test(sample$eye_difference) low[n] <- test$conf.int[1] high[n] <- test$conf.int[2]}ci_reps <-data.frame("replicate"=1:100, low, high)ggplot(ci_reps, aes(x = low, xend = high,y = replicate, yend = replicate)) +geom_segment() +geom_vline(xintercept =mean(optical$eye_difference), linetype ="dashed")
When interpreting a confidence interval, remember that not all values in an interval are equal. The center is always more likely than the ends.
6 Significance testing
A t-significance test is used to consider whether an individual parameter value of interest is plausible in light of sample data.
The situation in which the parameter has that value is referred to as the null hypothesis, and the situation in which it does not is referred as to the alternative hypothesis. For example: could be the difference between left eye value and right eye value actually be zero? Does a certain large company pay better than the national average, or could salaries there just be different through random chance?
A test statistic measures how far away the data in our sample are from what we would expect if the null hypothesis \(H_0\) were true.
The most common test statistic is the z-statistic. It determines how far an observed value is from the expected value, measured in standard errors.We already saw this formula before (?@eq-zscore)
\[
z= \frac{observed - expected}{SE}
\tag{8}\]
Observed is a statistic that is appropriate for assessing \(H_0\). For example, if we toss a coin ten times and want to know if it is a fair coin, the appropriate statistic would be the number of tails or the percentage of tails.
Expected and SE are the expected value and the SE of this statistic computed under the assumption that \(H_0\) is true.
Example:
We toss a coin 10 times, and get 7 tails. Using the formulas for the expected values of sums we have: Number of expected tails if the coin is fair:
For a binomial distribution we have: \(E=n * p\) where \(n\) is the number of repetitions and \(p\) is the the probability of success (0,5), and the standard error for binomial scenarios is \(SE = \sqrt{np(1-p)}\) .
\[
expected = 10 * \frac{1}{2} = 5
\]
\[
SE = \sqrt{10}\sqrt{\frac{1}{2} * \frac{1}{2}} =1.58
\]
\[
z= \frac{7-5}{1.58}=1.27
\]
By the central limit theorem, the \(p\)-value can be computed with normal approximation.
Code
z <-1.27# Calculate the cumulative probability up to z(p_value <- (1-pnorm(z))*2)
[1] 0.2040846
In other words, we have a 20.4% probability of observing 7 tails in 10 tosses given that the coin is fair.
If the \(z\)-value is large, that means that there is a great difference between the observed and the expected, so large values of \(z\) are evidence against \(H_0\). The strength of the evidence is measured by the \(p\)-value or observed significance level.
The \(p\)-value is the probability of getting a value of \(z\) as extreme or more extreme than the observed, assuming the null hypothesis is true. As we can see in the graph below, as \(z\) becomes larger, there is less probability of finding data outside of its limits, so the \(p\)-value will be smaller. We multiply the \(p\) value times 2 because we have to consider the two tails of the graph for this experiment.
The \(z\)-score tells you how far an observed value is from the expected value in terms of standard errors. The \(p\)-value tells you the probability of observing a test statistic as extreme as, or more extreme than, the observed statistic, under the null hypothesis.
The result of a significance test is a\(p\)-value, which measures how plausible the value of interest is given the sample data. A \(p\)-value close to zero indicates the value isn’t compatible with the data. As a general rule, if p<0.05, the value can be considered questionable.
Ideally, you should identify a value of interest (the technical term is null hypothesis) before collecting the data. While this isn’t always done in practice, doing so helps prevent wrong conclusions that come from the shotgun effect
If you run a test because you observed an interesting pattern in your sample data, the overall chance of encountering a significant result increases. Data is like firing a shotgun with many pellets—some are bound to hit the target purely due to randomness, if you create your null theory based on a characteristic observed in your sample you are likely going to find significance, even if that is not characteristics of the overall population and it was there just by chance.
While a \(p\)-value might tell you that a parameter is different from a hypothesized value, it won’t ever say how different it might be or if that difference is important.
A t-significance test is appropriate for quantitative data under the exact same circumstances as a t-confidence interval: unless the sample is very small and the data is highly non-symmetric. If the data is relatively symmetric and there are no extreme outliers, n=10 is usually enough. Regardless of the shape of the data, n=30 is a safe threshold.
The value of \(\mu\) (mu) is the value we want to test against (our null hypothesis)
Code
file1 <- here::here("data", "optical_sample.xlsx")optical_sample <-read_excel(file1)testResult <-t.test(optical_sample$eye_difference, mu =0) testResult$p.value
[1] 0.005840652
\(p\)-value express the probability of observing the values we have in the sample if the eye difference of the population were in fact 0. The smaller the \(p\)-value, the smaller the probability. Usually we take the threshold of 5% or \(p\)-value <0.05 to reject the null theory, as it means that it is unlikely that the mean observed in our sample can come from a population whose true mean is 0.
We can change the null theory to whatever value we want to measure against. For example, can we say that the average age of the population is NOT 59?
Code
testResult <-t.test(optical_sample$Age, mu =59)testResult$p.value
[1] 0.1078985
our \(p\)-value is {r}testResult$p.value which does not allow us to reject the null theory.
Quick Facts
A \(p\)-value below 0.05 doesn’t mean there’s a 5% chance the null hypothesis is true. Instead, it means if the null were true, there’s a 5% chance of observing data as extreme as, or more extreme than the one that was observed in the sample.
Statistical power, the ability to detect a true effect is often overlooked. A study can have a high chance of missing real effects (Type II error) even if its \(p\)-value threshold is stringent.
In proportional analysis, Simpson’s Paradox can occur where a trend seen in several groups reverses when the groups are combined. It underscores the importance of scrutinizing aggregated data.
6.0.1 One sided vs two sided test.
One has to carefully consider whether the alternative should be one-sided or two-sided test (we are considering the values on the left of -z and the right of z) or only the values on one side. If we are considering a two sided test, the \(p\)-value gets doubled. It is not ok to change the alternative afterwards in order to get the \(p\)-value under 5%.
Deciding whether to use a one-sided or two-sided t-test depends on the hypothesis you want to test. Here are some guidelines to help you make that decision:
Use a one-sided t-test when you have a specific direction in mind for your hypothesis. This means you are testing whether the mean is either greater than or less than a certain value, but not both.
Examples:
Greater than: You want to test if the mean score of a new teaching method is greater than the mean score of the traditional method.
Less than: You want to test if the mean time to complete a task using a new software is less than the mean time using the old software.
Use a two-sided t-test when you are interested in any difference from the specified value, regardless of direction. This means you are testing whether the mean is different from a certain value, without specifying the direction of the difference.
Examples:
You want to test if the mean weight of a sample of apples is different from a known standard weight.
You want to test if the mean score of a new drug is different from the mean score of a placebo.
How to Decide
Define Your Research Question: Clearly state what you are trying to find out.
Determine the Direction of Interest: Decide if you are only interested in deviations in one direction (one-sided) or in both directions (two-sided).
Formulate Your Hypotheses: Based on your research question and direction of interest, formulate your null and alternative hypotheses.
Example Scenario
Suppose you are testing a new drug and want to know if it has a different effect on blood pressure compared to a placebo. If you only care whether the drug lowers blood pressure, you would use a one-sided test. If you care whether the drug either lowers or raises blood pressure, you would use a two-sided test.
Code
# Sample datasample_data <-c(78, 82, 85, 90, 76, 79, 81, 77, 74, 88)# Perform one-sided t-testt_test_result <-t.test(sample_data, mu =75, alternative ="greater")# Print the resultprint(t_test_result)
One Sample t-test
data: sample_data
t = 3.6, df = 9, p-value = 0.002874
alternative hypothesis: true mean is greater than 75
95 percent confidence interval:
77.94481 Inf
sample estimates:
mean of x
81
Code
# Perform two-sided t-testt_test_result <-t.test(sample_data, mu =75, alternative ="two.sided")# Print the resultprint(t_test_result)
One Sample t-test
data: sample_data
t = 3.6, df = 9, p-value = 0.005748
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
77.22974 84.77026
sample estimates:
mean of x
81
6.0.2 Common errors in significance testing
At the start of a significance test, the hypothesized value might be true or false. At the end, it may be rejected or not, but we still don’t know for sure if \(H_0\) is true or not. If the null hypothesis holds, then we say that it is a true null hypothesis otherwise is a false null hypothesis Altogether, there are four possible combinations of these outcomes:
\(H_0\) is true
\(H_0\) is false
\(H_0\) is rejected
Type I error -false positive-
True Negative
\(H_0\) is not rejected
True Positive
Type II error -False negative -
We call the event of rejecting the null hypothesis, when it is in fact true, a Type I error, we call the probability of making a Type I error, the Type I error rate, and we say that rejecting the null hypothesis when the \(p\)-value is less than \(\alpha\), controls the Type I error rate so that it is equal to \(\alpha\)
We may observe a \(p\)-value higher than alpha even when the null hypothesis is false. This is a Type II error. We ask ourselves this question: How big does \(N\) have to be in order to detect that the absolute value of the difference is greater than zero? Type II error control plays a major role in designing data collection procedures before you actually see the data, so that you know the test you will run has enough sensitivity or power. Power is 1 minus Type II error rate, or the probability that you will reject the null hypothesis when the alternative hypothesis is true.
There are several aspects of a hypothesis test that affect its power for a particular effect size. Intuitively, setting a lower \(\alpha\) decreases the power of the test for a given effect size because the null hypothesis will be more difficult to reject (for example from 0.05 to 0.01). This means that for an experiment with fixed parameters (i.e., with a predetermined sample size, recording mechanism, etc), the power of the hypothesis test trades off with its Type I error rate, no matter what effect size you target.
::: {.callout-orange, appearance=“simple”, icon=“false”} Testing a Hypothesis
Conducting a hypothesis test typically proceeds in four steps. First, we define the null and alternative hypotheses. Next, we construct a test statistic that summarizes the strength of evidence against the null hypothesis. We then compute a \(p\)-value that quantifies the probability of having obtained a comparable or more extreme value of the test statistic under the null hypothesis. Finally, based on the \(p\)-value, we decide whether to reject the null hypothesis.
Step 1: Define the Null and Alternative Hypotheses.
The null hypothesis \(H_0\) is the default state of belief about the world. The alternative hypothesis \(H_a\) represent something different. We can think about rejecting the null hypothesis as making a discovery about our data, but if we fail to reject it, we do not know whether we failed to reject it because or sample was too small or whether we failed to reject it because the null hypothesis holds.
Step 2. Construct the Test Statistic
The way in which we construct our test statistic T depends on the nature of the null hypothesis that we are testing.
Step 3. Compute the \(p\)-value
The \(p\)-value is the area under the density function of the null distribution that falls to the right of the T statistic that we calculated in the previous step, and so, gives us the probability of observing a value such us or bigger than the T-Statistic under the assumption that the null hypothesis holds. In general, most commonly-used test statistic follow a well-known statistical distribution under the null hypothesis, such as normal distribution, t-distribution, a \(X^2\) distribution or an \(F\) distribution, provided that the sample size is sufficiently large and that some other assumptions hold.
The \(p\)-value allow us to transform our test statistic, which is measured on some arbitrary and uninterpretable scale, into a number between 0 and 1 that can be easily interpreted.
Step 4. Decide whether to reject the null hypothesis.
Once we have computed the \(p\)-value, it remains for us to decide whether or not to reject the null hypothesis. If the \(p\)-value is sufficiently small, we want to reject it, but deciding what value is sufficiently small is up to us. :::
6.0.3 Statistical power
Statistical power is the ability of a test to detect a specified effect. A test with low power is unlikely to rule out a hypothesized value, even if the value is false. That is, it has a high probability of type II error. Increasing the power of an inference technique requires a trade-off of one sort or another. In practice, this usually means using a larger sample. A preliminary power analysis can give decision-makers information about the study size needed to detect an effect of interest.
A very simple example of how to do that: study planners can estimate the sample size needed to reduce the margin of error in the estimate of a population proportion using this formula: \[
n = \left( \frac{1.96}{2E} \right)^2
\]
where \(E\) is the required margin of error. According to this formula, a simple political poll requiring a 3% margin of error would need a sample size of approximately n=1067 respondents.
We are going to use our mice dataset to see how we can calculate the power of our t test for type II errors. We consider that the data in the file is our entire population. If we look at the difference average weight for control vs treatment we appreciate that there is in fact a difference or around 9%:
Depending on our sample size, this difference will be perceived by our test or not. Let’s see an example with sample size = 12 and alpha = 0.05. We will do first a single test and see that we cannot reject the null hypothesis based on the \(p\)-value:
In this case we see that 25% of the time we correctly rejected the null hypothesis and this is the power of our test. To increase this percentage we should increase the sample size. In the next code section we are going to run the same simulation for various values of N
There are on the internet several tools that allow you to calculate the the power you need for a particular standard deviation, sizes of n or the effect size you want to detect.
When performing a hypothesis test over a large number of data points, the sample size can significantly impact the \(p\)-value.
Influence of Sample Size on the p-value:
Increased Sensitivity: A larger sample size increases the power of the test, meaning it’s more likely to detect a true effect if one exists. This is because with more data, we get a clearer picture of the population, reducing the standard error.
Reduced Standard Error: The standard error of the mean decreases as the sample size increases. This makes the test statistic (such as the \(t\)-score or \(z\)-score) larger if there is an effect, which can lead to a smaller \(p\)-value.
Smaller p-values: With a large sample size, even small differences from the null hypothesis can result in statistically significant \(p\)-values. This is because the test becomes more sensitive to detecting small effects.
Risk of Overinterpretation: While small \(p\)-values indicate statistical significance, they do not necessarily imply practical significance. In large datasets, it’s crucial to consider the effect size and practical relevance of the results.
Example: Consider a hypothesis test to determine whether a new drug has a different effect than a placebo. If you have a very large sample size, even a tiny difference in outcomes between the drug and placebo groups may produce a statistically significant \(p\)-value, even if the difference is not clinically meaningful.
Visual Explanation: Imagine plotting the p-value as a function of sample size. Initially, as the sample size increases, the p-value decreases sharply, indicating stronger evidence against the null hypothesis. However, beyond a certain point, the \(p\)-value becomes very small for even minute differences, highlighting the need to interpret results in context.
Large Sample Size: Leads to smaller \(p\)-values, higher test sensitivity, and potential for detecting trivial differences as significant.
Interpret Results Carefully: Always consider effect size and practical significance, not just statistical significance.
Code
# Set seed for reproducibilityset.seed(123)# Generate synthetic datasmall_sample <-rnorm(30, mean =0.03, sd =1)large_sample <-rnorm(2000, mean =0.03, sd =1)# Function to calculate p-values for increasing sample sizescalculate_p_values <-function(data, max_size) { p_values <-vector("numeric", max_size)for (i in2:max_size) { # Start from 2 to avoid t-test error sample_data <- data[1:i] test_result <-t.test(sample_data, mu =0) p_values[i] <- test_result$p.value }return(p_values)}# Calculate p-values for both small and large samplessmall_sample_p_values <-calculate_p_values(small_sample, length(small_sample))large_sample_p_values <-calculate_p_values(large_sample, length(large_sample))# Combine results into a data framep_values_df <-data.frame(Sample_Size =c(1:length(small_sample), 1:length(large_sample)),P_Value =c(small_sample_p_values, large_sample_p_values),Sample_Type =rep(c("Small_Sample", "Large_Sample"), c(length(small_sample), length(large_sample))))# Filter out NA valuesp_values_df <- p_values_df %>%filter(!is.na(P_Value))# Plot the p-values with facetsggplot(p_values_df, aes(x = Sample_Size, y = P_Value, color = Sample_Type)) +geom_line() +labs(title ="Influence of Sample Size on P-Value",x ="Sample Size",y ="P-Value",color ="Sample Type") +theme_minimal() +geom_hline(yintercept =0.05, linetype ="dashed", color ="red") +scale_color_manual(values =c("Small_Sample"="blue", "Large_Sample"="green")) +facet_wrap(~ Sample_Type, scales ="free_x", ncol =1)+ylim(0, 1)
In the graphs above we show how even a very tiny deviation (our data’s \(\mu =0.03\)) from our null hypothesis (\(\mu =0\)) is showing as significant when the sample is above 1500 data points, the graph below shows the test over only a maximum of 30 data points and never detects the same difference as significant.
6.0.4 Student’s t test
So far we have been using the t-test without explaining the student’s t-distribution that is used under the hood. Let’s explain it now.
We will explain the student’s t test with an example: The health guideline for lead in drinking water is a concentration of not more than 15 parts per billion (ppb). Five independent samples from a reservoir average 15.6 ppb. Is this sufficient evicence to conclude that the concentration \(\mu\) in the reservoir is above the standard of 15 ppb?
Our null hypothesis is that no, the concentration is just on the standard \(H_0:\mu=15\) and the alternative hypothesis is that the concentration is higher than 15 ppb.
SE of average = \(\frac{\sigma}{\sqrt{n}}\) but sigma is unknown. By the boostrap principle we know that we should be able to substitute the standard deviation of the population \(\sigma\) with the standard deviation of our sample \(s\) however for sample size less or equal to 20, then the normal curve is not a good enough approximation to the distribution of the \(z\)-statistic. Rather, an appropriate approximation is the Student’s t-distribution with n-1 degrees of freedom.
In the graph we can see the purple line where the degrees of freedom are high (large sample) is just the normal curve. The rest of the lines are showing how with less degrees of freedom the tails of the curve get bigger, representing higher uncertainty. We saw the sample standard deviation formula (?@eq-sampleStandarDeviation):
\[
s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline{x})^2}
\]
In this case we also need to adjust our confidence interval formula from: \(CI=estimate\mp z\ SE\)
to \[
CI= estimate (\mu) \mp t_{n-1}SE
\tag{9}\]
z-test vs t-test
The main differences between a t-test and a \(z\)-test lie in their assumptions and applications:
T-Test
Distribution: Uses the Student’s t-distribution.
Population Variance: Unknown and estimated from the sample.
Sample Size: Typically used for small sample sizes (n < 30).
Degrees of Freedom: Required for the calculation.
Application: Used when comparing the means of two groups, especially when the sample size is small and the population variance is unknown.
Z-Test
Distribution: Uses the standard normal distribution (z-distribution).
Population Variance: Known.
Sample Size: Typically used for large sample sizes (n > 30).
Degrees of Freedom: Not required.
Application: Used for hypothesis testing of means and proportions when the sample size is large and the population variance is known.
Key Differences
Distribution:
T-test: Student’s t-distribution.
\(z\)-test: Standard normal distribution.
Population Variance:
T-test: Unknown and estimated from the sample.
\(z\)-test: Known.
Sample Size:
T-test: Small sample sizes (n < 30).
\(z\)-test: Large sample sizes (n > 30).
Degrees of Freedom:
T-test: Required.
\(z\)-test: Not required.
Example Scenarios
T-Test: Comparing the average test scores of two small classes where the population variance is unknown.
Z-Test: Testing the average height of a large population where the population variance is known.
T-test are the most commonly used test in real life, but if we meet all the criteria to perform a \(z\)-test we can do it like this in r:
One-sample z-Test
data: sample_data
z = 0.90933, p-value = 0.3632
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
96.47608 109.62392
sample estimates:
mean of x
103.05
7 Categorical data
In 1912 the Titanic sank and more than 1500 of the 2229 people on board died. Did the chances of survival depend on the ticket class?
First
Second
Third
Crew
Survived
202
118
178
215
Died
123
167
528
698
This is an example of categorical data. The data are counts for a fixed number of categories. Here the data is tabulated in a 2x4 table. Such table is called a Contingency Table because it shows the survival counts for each category of ticket class, i.e. contingent on ticket class.
7.1 Confidence Intervals for Proportions.
Confidence intervals for proportions provide a range of plausible values for population proportions, enabling estimation with a desired level of confidence.
Proportional analysis techniques allow for the assessment and comparison of proportions across different categories, providing valuable insights into categorical data.
There are any number of statistical methods for computing confidence intervals for proportions. By far the simplest is:
where \(p\) is the actual population proportion and \(\hat{p}\) is the observed proportion.
As long as the sample size isn’t too small, the difference between methods should be minor. If your sample includes at least 5 of each sort of possible outcomes, you are fine with whatever default method your software uses to calculate the proportions.
In R we will use \(prop.test(n_s , sample size)\) where \(n_s\) is the number of successes.
Practice:
We are going to calculate the proportion of smokers in our population based on our optical_sample data
1-sample proportions test with continuity correction
data: successNumber out of sampleSize, null probability 0.5
X-squared = 0.09, df = 1, p-value = 0.7642
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3798722 0.5816817
sample estimates:
p
0.48
Code
#proportion of success proportion<- testResult$estimate %>%print()
We can use the formula we studied at the beginning of the chapter to manually calculate the confidence intervals as well and see the difference with what the software used:
1-sample proportions test with continuity correction
data: numberOfWomen out of sampleSize, null probability 0.5
X-squared = 91.24, df = 1, p-value < 0.00000000000000022
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.7653993 0.8701140
sample estimates:
p
0.8235294
7.1.2 Few observations for each outcome
While there are statistical methods for dealing with samples with fewer than 5 or each sort of outcome, you should be conservative about making decisions based on them. That said, there are two methods that might be helpful in such situations:
The wilson score interval works well even for observed proportions near zero or one and has other good theoretical properties as well. Unlike other confidence intervals we’ve seen it isn’t symmetric.
The rule of three is a great rought-and-ready way to compute a 95% confidence interval for a proportion when no positive results have been observed. If you have failures only, the interval will go from 0 to 3/n where n is the sample size, and if you have observed successes only, the confidence interval will go from 1 - (3/n) to 0
7.2 Significance testing for proportions
We want to know if the proportion of the people taking medication in our sample is the same as the proportion of people in USA taking medication. This data we know is 66% of the USA population so we can plug in that value in our test just like that in the formula using \(p=\) where p, in this case is the proportion for our null theory.
In our case we can reject the null hypothesis and say that it is not reasonable to believe that this sample was drawn from a population with sample mean of 66%
Exercises
Exercise 1. A media outlet claims that 60% of project managers in the U.S have a college degree as their highest level of education. Is that claim plausible using our pm_survey data set?
The confidence interval shows that between 44% and 57% of respondents are under 35
7.3 Goodness of Fit
Consider genetic data where you have two groups of genotypes (A or B) for cases and controls for a given disease. The statistical question is if genotype and disease are associated. As in the examples we have been studying previously, we have two populations (A and B) and then numeric data for each, where disease status can be coded as 0 or 1. So why can’t we perform a t-test? Note that the data is either 0 (control) or 1 (cases). It is pretty clear that this data is not normally distributed so the t-distribution approximation is certainly out of the question. We could use CLT if the sample size is large enough, otherwise, we can use association tests.
Imagine we have 250 individuals, where some of them have a given disease and the rest do not. We observe that 20% of the individuals that are homozygous for the minor allele (group B) have the disease compared to 10% of the rest. Would we see this again if we picked another 250 individuals?
Code
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),labels=c("control","cases"))genotype=factor(c(rep("A",200),rep("B",50)),levels=c("A","B"))dat <-data.frame(disease, genotype)dat <- dat[sample(nrow(dat)),] #shuffle them uptab<-table(genotype,disease)tab
disease
genotype control cases
A 180 20
B 40 10
The typical statistics we use to summarize these results is the odds ratio (OR). We compute the odds of having the disease if you are an “B”: 10/40, the odds of having the disease if you are an “A”: 20/180, and take the ratio: \((10/40) / (20/180)\)
Code
(tab[2,2]/tab[2,1]) / (tab[1,2]/tab[1,1])
[1] 2.25
To compute a \(p\)-value, we don’t use the OR directly. We instead assume that there is no association between genotype and disease, and then compute what we expect to see in each cell of the table under the null hypothesis: ignoring the groups, the probabilty of having the disease according to our data is:
Code
p=mean(disease=="cases")p
[1] 0.12
according to this probability, under our null hypothesis we expect to see a table like this:
The Chi-square test uses an asymptotic result (similar to the CLT) related to the sums of independent binary outcomes. Using this approximation, we can compute the probability of seeing a deviation from the expected table as big as the one we saw. The \(p\)-value for this table is:
Code
chisq.test(tab)$p.value
[1] 0.08857435
so we would expect to find this difference in disease ratio by chance approximately 8.8% of the times, which is not enough to reject our null hypothesis.
Large Samples, Small\(p\)-values
Reporting only \(p\)-values is not an appropriate way to report the results of your experiment. Studies with large sample sizes will have impressively small \(p\)-values. Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1.
To demonstrate, we recalculate the \(p\)-value keeping all the proportions identical, but increasing the sample size by 10, which reduces the \(p\)-value substantially:
Code
tab<-tab*10chisq.test(tab)$p.value
[1] 0.000000001219624
Fitting Fallacies: Goodness of fit doesn’t guarantee predictive power. A model can fit past data perfectly yet fail miserably on new, unseen data, highlighting the dangers of overfitting.
The Sample Size Paradox: Doubling your sample size doesn’t necessarily halve the error. In fact, to do so, you’d need to quadruple the sample, given the square root relationship between sample size and margin of error.
Two-Sample Twists: When comparing two samples, it’s possible for each sample’s individual data to appear random, yet their combined data can reveal a distinct pattern or significant difference.
A Goodness-of-fit test, also called chi-squared or Pearson goodness-of-fit test, considers whether a categorical variable has a hypothesized distribution
Warning! A goodness of fit test doesn’t give any information about which specific categories might be out of line with the hypothesized distribution. While you might be able to make an educated guess by looking at the data, this test shouldn’t be used to support that kind of intuition.
In 208 the manufacturer of M&Ms published their last color distribution:
Blue
Orange
Green
Yellow
Red
Brow
24%
20%
16%
14%
13%
13%
We open several packages of M&Ms and count the colors:
Blue
Orange
Green
Yellow
Red
Brow
85
79
56
64
58
68
Are these counts consistent with the last published percentages? is there sufficient evidence to claim that the color distribution is different? This question requires a test of goodness-of-fit for the six categories.
Our null hypothesis is that the color distribution is the same. The idea is to compare the observed counts to the numbers we would expect if $H_0 $ is true, so for our sample of 410 M&Ms we would expect:
Blue
Orange
Green
Yellow
Red
Brow
98.4
82
65.6
57.4
53.3
53.3
We look at the difference of the observed and the expected values, we square that difference and we standardize it by dividing by the expected
Large values of the chi-square statistic \(\chi^2\) are evidence against the null hypothesis. To calculate the \(p\)-value we use the chi-square distribution. The \(p\)-value is the right tail of the \(\chi^2\) distribution with degrees of freedom = number of categories -1.
Code
df <-5# Create the chi-square distribution curvecurve(dchisq(x, df = df), from =0, to =40, main =paste("Chi-Square Distribution (df =", df, ")"),ylab ="Density", lwd =2, col ="steelblue")# Draw a vertical line at x = 8.57abline(v =8.57, col ="red", lwd =2, lty =2)# Highlight the area to the right of x = 8.57x_vector <-seq(8.57, 40, length.out =1000)y_vector <-dchisq(x_vector, df = df)polygon(c(8.57, x_vector, 40), c(0, y_vector, 0), col =adjustcolor("red", alpha.f =0.3), border =NA)
in our example, this \(p\)-value happens to be 12.7%, which does not allow us to reject the null hypothesis. In r we can calculate the \(p\)-value like this:
Code
# Set the chi-square value and degrees of freedomx <-8.57df <-5# Calculate the p-valuep_value <-pchisq(x, df, lower.tail =FALSE)# Print the p-valuep_value
[1] 0.1274943
Exercise:
In the exercise below we want to know if the age population in our project managers survey data matches the distribution of ages for the USA population. We get the distribution of USA from wikipedia and do some adjustments so the ranges matches those in our survey:
18-24 25-34 35-44 45-54 55-64 65 or over
4 108 77 27 4 1
Code
obs_ages <-as.numeric(table(pm_survey$how_old_are_you))testResult <-chisq.test(obs_ages, p = us_ages)testResult %>%report()
Effect sizes were labelled following Funder's (2019) recommendations.
The Chi-squared test for given probabilities / goodness of fit of obs_ages to a
distribution of [: n=25.857, : n=38.896, : n=37.128, : n=34.918, : n=36.686, :
n=47.515] suggests that the effect is statistically significant, and large
(chi2 = 260.52, p < .001; Fei = 0.40, 95% CI [0.35, 1.00])
The very small \(p\)-value indicates that is extremely unlikely that our pm_survey data was extracted at random from a population with the us_ages distribution.
As mentioned, the goodness of fit test does not tell us what categories show the discrepancies in the data, if we want to find out what is the difference between our sample range of ages and the USA data we can just subtract the proportions from each and see what categories are sub represented and vice versa
Now we want to see if in our optical sample the customers are evenly distributed between the opticians. If we don’t pass a \(p\) attribute to the chi squared test it will assume you are asking for even distribution between all categories:
Code
file2 <- here::here("data", "optical_full.xlsx")optical_full <-read_excel(file2)counts <-as.numeric(table(optical_full$`Optician Last Name`))testResult<-chisq.test(counts)testResult %>%report()
Effect sizes were labelled following Funder's (2019) recommendations.
The Chi-squared test for given probabilities / goodness of fit of counts to a
uniform distribution suggests that the effect is statistically not significant,
and tiny (chi2 = 46.44, p = 0.164; Fei = 0.01, 95% CI [0.00, 1.00])
Code
testResult$p.value
[1] 0.1636262
In this case we cannot reject the null hypothesis, which in this case is that the count of patients is evenly distributed among all opticians.
Exercise
Does the distribution of populations of towns in the United States follow Benford’s law? Check using the towns data set.
testResult<-chisq.test(digits_freq, p = benford)testResult %>%report()
Effect sizes were labelled following Funder's (2019) recommendations.
The Chi-squared test for given probabilities / goodness of fit of digits_freq
to a distribution of [: n=3010, : n=1760, : n=1250, : n=970, : n=790, : n=670,
: n=580, : n=510, : n=460] suggests that the effect is statistically not
significant, and tiny (chi2 = 9.24, p = 0.323; Fei = 6.67e-03, 95% CI [0.00,
1.00])
our \(p\)-value indicates that we cannot reject the null hypothesis, meaning in this case that the distribution of the first digit in towns across US is compatible with Benford’s law.
8 Two-Sample testing
Two-sample testing allows for the comparison of two independent groups or populations to assess if there are statistically significant differences between their means, proportions, or other relevant measures.
Confidence intervals for two-sample comparison provide a range of plausible values for the difference in means or proportions, allowing for estimation with a desired level of confidence.
Significance testing for two-sample comparison involves evaluating the evidence against the null hypothesis and determining if the observed differences between groups are statistically significant.
In this simple exercise we are going to generate ratings for two products at random. We have not changed the probability of each ranking so both products should have the same rating so the difference of their mean ratings should be 0 if we have enough samples.
Creating many samples at random, calculating their differences and plotting those differences will show how, although we can see that the bell curve distributes more or less evenly from 0 as expected, many of the samples showed a difference
In reality we will only have access to one of the samples, and we have to be able to tell if it’s reasonable to draw a conclusion about the population just based on the sample or whether or not we should just attribute these sorts of differences to random chance.
8.1 Significance testing for Two-Samples data
8.1.1 z-test
The significance test for two samples uses the null hypothesis that there is no difference in the means of the two population means. The \(p\)-value will be used to reject or not that null hypothesys.
we can use a \(z\)-test for the difference between the two means:
our expected difference is 0 because that’s our null hypothesis (no difference). An important fact is that if \(\bar{x_1}\) and \(\bar{x_2}\) are independent, then:
Last month, the president’s approval rating in a sample of 1000 likely voters was 55%. This month, a poll of 1,500 likely voters resulted in a rating of 58%. Is this sufficient evidence to conclude that the rating has changed?
\(\hat{p_1} = 55%\) and \(\hat{p_2} = 58%\)
\[
z = \frac{(\hat{p_2}-\hat{p_1})-0}{SE_{diff}}=\frac{(\hat{p_2}-\hat{p_1})-0}{\sqrt{ (SE(\bar{x_1}))^2 +(SE(\bar{x_2}))^2}}
\] The formula for the standard error for the proportion is
The calculated \(z\)-value is approximately 1.49. To determine if this is statistically significant, you would compare this \(z\)-value to the critical value from the standard normal distribution (typically 1.96 for a 95% confidence level). Since 1.49 is less than 1.96, you would fail to reject the null hypothesis at the 95% confidence level, indicating that there is not sufficient evidence to conclude that the president’s approval rating has changed significantly.
The \(p\)-value can be calculated using standard normal distribution tables or software
Code
pValue <-pnorm(1.49)pValue
[1] 0.9318879
Since this is a two-tailed test (we are checking if the approval rating has changed, not just increased or decreased), we need to consider both tails of the distribution, so we double our values: The \(p\)-value is calculated as \(2 \times (1 - \text{cumulative probability})\). \(p = 2 \times (1 - 0.9318) = 2 \times 0.0682 = 0.1364\)
The \(p\)-value for the \(z\)-value of 1.49 is approximately 0.1364. Since this \(p\)-value is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This means there is not sufficient evidence to conclude that the president’s approval rating has changed significantly.
The exercise above shows how to calculate the \(p\) value for proportions, if we are working with average values instead of proportions the calculation is the same, only thing to consider is that in this case is that Standard deviation of each individual sample will be calculated using the formula for the standard deviation for the mean \(SE(\bar{x_1})= \frac{\sigma_1}{\sqrt{n_1}}= \frac{s_1}{\sqrt{n_1}}\) . If the sample sizes are not large, then the \(p\)-value needs to be computed from the t-distribution instead.
8.1.2 The Welch Two Sample t-test
If we have two sets, one control and one test, their means being \(\mu_t\) and \(\mu_c\) the two-sample t-statistic is defined as:
\[
T = \frac{\hat\mu_t-\hat\mu_c}{s\sqrt{\frac{1}{n_t}+\frac{1}{n_c}}}
\tag{11}\]
is an estimator of the pooled standard deviation of the two samples. Here, \(s^2_t\) and \(s^2_c\) are unbiased estimators of the variance of our metric in the treatment and control group, respectively. A large (absolute) value of T provides evidence against \(H_0\)
In the example below we are going to use the attrition dataset to see if the average age of employees in two different departments is different?
# A tibble: 2 × 2
Department avg_age
<chr> <dbl>
1 Research_Development 37.0
2 Sales 36.5
Code
testResult <-t.test(Age ~ Department, data = attrition1)testResult$p.value
[1] 0.3366682
Code
report(testResult)
Effect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of Age by Department (mean
in group Research_Development = 37.04, mean in group Sales = 36.54) suggests
that the effect is positive, statistically not significant, and very small
(difference = 0.50, 95% CI [-0.52, 1.52], t(880.05) = 0.96, p = 0.337; Cohen's
d = 0.06, 95% CI [-0.07, 0.20])
8.2 Confidence intervals for Two-Samples data.
We can use the formula for the standard error of the difference to also do a confidence interval calculation. The confidence interval for \(p_2-p_1\) is \((\hat{p_2}-\hat{p_1}) \mp z \times SE(\hat{p_2}-\hat{p_1})\)
were \(z\) is the \(z\)-score for the confidence level we are interested in.
in our example about the voters approval of the president:
the \(z\)-score value for a confidence level of 95% is 1.959964
If we want to resolve the same exercise using r code it gives similar but not exactly the same results:
Code
successes <-c(0.55*1000, 0.58*1500)# Define the sample sizessample_sizes <-c(1000, 1500)# Perform the two-sample z-test for proportionstest_result <-prop.test(successes, sample_sizes)# Print the resultprint(test_result)
2-sample test for equality of proportions with continuity correction
data: successes out of sample_sizes
X-squared = 2.0801, df = 1, p-value = 0.1492
alternative hypothesis: two.sided
95 percent confidence interval:
-0.07051474 0.01051474
sample estimates:
prop 1 prop 2
0.55 0.58
There are many statistical techniques for describing the difference between a single variable across two populations. The most universal is the Welch two-sample confidence interval, which is a statistical technique used to compare means between two independent groups, taking into account the unequal variances often encountered in real-world data, so it valid in nearly all circumstances. A few things to bear in mind:
It is a multi-sample procedure, not a multi-variable one. Use it when you’re asking how much two samples differ in a single variable.
Like every other statistical tool in this course, it requires that all the observations be independent of one another (not to be used with time-line analysis)
Unless the data has extreme outliers or is highly asymmetric, it will give good results when both samples are of size 10 or more. If the samples are of size at least 30, it is fine under all but the most extreme circumstances.
The Welch two-sample confidence interval does not require that the samples are equal in size or that the populations have equal variances, unlike some other procedures.
We are going to use the AB_testing data we generated in the previous section and calculate the confidence intervals for the differences observed in one of our samples
testResult<-t.test(rating ~ product, data = AB_testing) testResult %>%report()
Effect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of rating by product (mean
in group A = 2.20, mean in group B = 2.83) suggests that the effect is
negative, statistically not significant, and small (difference = -0.63, 95% CI
[-1.88, 0.61], t(23.25) = -1.05, p = 0.305; Cohen's d = -0.41, 95% CI [-1.17,
0.37])
The confidence interval says that with 95% confidence, the population mean difference is between -1.8804465 and 0.6137799. This is actually saying that the difference in the means could be 0.
If you know or can assume that the variance of the two populations is equal, then you can use var.equal = TRUE in the t.test, and in this case instead of using a Welch Two sample t-test, it will use a Two sample t-test
Code
t.test(rating ~ product, data = AB_testing, var.equal =TRUE)
Two Sample t-test
data: rating by product
t = -1.055, df = 25, p-value = 0.3015
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
-1.8697428 0.6030761
sample estimates:
mean in group A mean in group B
2.200000 2.833333
Code
testResult<-t.test(rating ~ product, data = AB_testing) testResult %>%report()
Effect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of rating by product (mean
in group A = 2.20, mean in group B = 2.83) suggests that the effect is
negative, statistically not significant, and small (difference = -0.63, 95% CI
[-1.88, 0.61], t(23.25) = -1.05, p = 0.305; Cohen's d = -0.41, 95% CI [-1.17,
0.37])
In the example below we want to use our substance_abuse data set and know if there is a difference in the variable DLA_improvement based on the program that the patient was following. We are assuming that the variance is equal in both cases:
We can change the confidence level manually if we want:
Welch Two Sample t-test
data: DLA_improvement by Program
t = 38.23, df = 150.6, p-value < 0.00000000000000022
alternative hypothesis: true difference in means between group Intervention and group UsualCare is not equal to 0
99 percent confidence interval:
0.5539340 0.6350745
sample estimates:
mean in group Intervention mean in group UsualCare
0.591468531 -0.003035714
Exercises
Exercise 1: Generate a 95% confidence interval for the difference in average monthly income between the research and development and sales departments of the company
Effect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of MonthlyIncome by
Department (mean in group Research_Development = 6281.25, mean in group Sales =
6959.17) suggests that the effect is negative, statistically significant, and
very small (difference = -677.92, 95% CI [-1166.04, -189.80], t(1030.99) =
-2.73, p = 0.007; Cohen's d = -0.17, 95% CI [-0.29, -0.05])
The confidence interval does not include 0, which means that there is a difference in the means of the populations of both groups
Exercise 2: It is reasonable to claim that the montly rate is the same bewteen these two departments?
Effect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of MonthlyRate by Department
(mean in group Research_Development = 14284.87, mean in group Sales = 14489.79)
suggests that the effect is negative, statistically not significant, and very
small (difference = -204.93, 95% CI [-1007.25, 597.40], t(858.77) = -0.50, p =
0.616; Cohen's d = -0.03, 95% CI [-0.17, 0.10])
In light of the results we cannot reject the null hypothesis that they have the same monthly rate.
8.2.1 Pooled estimate:
Going back to our recent exercise of the voters’s approval rate, we concluded that the two different surveys did not significantly differed.
Since we could not reject the null hypothesis that the two samples are representing the same approval for the candidate, we can combine the two of them to find a better estimate of our Standard Error
in the first sample we have 0,55 x 1000 voters who approved, in the second sample we have 0.58x 1500 so in total we have 1420 approvals out of 2500 people surveyed, so our ppoled estimate will be \(\frac{1420}{2500}=56.8\%\) so now we can calculate the Standard Error using that new value:
If we know (or there is reason to assume) that the standard deviation of the two populations is the same \(\sigma_1=\sigma_2\) then we can use the pooled estimate:
Two-Sample\(z\)-Test vs Two sample T-test vs Welch’s Two-Sample T-Test
Two sample\(z\)-test
When to Use:
Known Population Variances: The population variances are known.
Large Sample Sizes: Typically used when the sample sizes are large (n > 30).
Normal Distribution: Assumes that the data follows a normal distribution.
Two sample T-test
Unknown Population Variances: The population variances are unknown and assumed to be equal.
Small Sample Sizes: Typically used when the sample sizes are small (n < 30).
Normal Distribution: Assumes that the data follows a normal distribution.
Welch’s Two-Sample T-Test
When to Use:
Unknown and Unequal Population Variances: The population variances are unknown and not assumed to be equal.
Small or Unequal Sample Sizes: Can be used for small or unequal sample sizes.
Normal Distribution: Assumes that the data follows a normal distribution.
Summary:
Population Variances:
Z-Test: Assumes known and equal population variances.
Welch’s T-Test: Does not assume equal variances and uses sample variances.
Sample Size:
Z-Test: Typically used for large sample sizes.
Welch’s T-Test: Can be used for small or unequal sample sizes.
8.3 Paired-t test
What do we do when we have two samples, but they are not independent from each other? In this case we cannot use the classical two-sample \(z\)-test or two-sample t-test
We want to answer the question: Do husbands tend to be older than their wives?
Husband’s age
Wife’s age
age difference
43
41
2
71
70
1
32
31
1
68
66
2
27
26
1
In a scenario like this, even if the samples were independent, the test would not give us a significant difference because the difference between the two pairs is always small, while the difference in the values in each sample (standard deviations) are large and what the two samples test does is to compare the differences to the fluctuation within each population.
In this case we will use the paired t-test. Our \(H_0\) is that the population difference is 0. The formula for this test is:
\[
t=\frac{\bar{d}-0}{SE_{(d)}}
\]
where \(\bar{d}\) is the average of the differences,
0 is the expected difference that under our null hypothesis is 0 and \(SE_{\bar{(d)}}\) is the standard error for the difference:
and now we have to use a table of a student t-distribution with 4 degrees of freedom to find the area under the curve of the normal distribution to the right of our t value.
we can use R code to calculate it like this:
Code
# Given valuest_value <-5.69degreesfreedom <-4# Calculate the p-value for a two-tailed testp_value <-2*pt(-abs(t_value), degreesfreedom)p_value
[1] 0.004711695
In this case our result means that we can reject the null hypothesis.
8.4 The sign test
Image that we don’t know the age difference, we only know if the husbands are older or not. We can follow here the same approach as with a binomial distribution, like the coin toss. In this specific case our null hypothesis \(H_0\) is that half of the husbands are older than their wifes (no difference). We assign labels to the results, for example 1 if the husband is older than the wife and 0 otherwise.
in our case we have 5 husbands being older than their wifes, so 5 1s and the standard deviation for our null hypothesis will be \(\frac{1}{2}\) because we expect half the husbands to be older then their wives.
now we can find the \(p\) value using a table or software:
Code
z_score<-2.24# Calculate the p-value for a two-tailed testp_value <-2* (1-pnorm(z_score))p_value
[1] 0.02509092
we can see that the result of this test is less significant than the test we did before, this is because we have less data to work with.
If we want to resolve the problem using software we can do the calculations:
Code
# Number of successes (husbands older than wives) successes <-5# Total number of pairs n <-5# Z-test (Sign test approximation) z_score <- (successes -0.5* n) /sqrt(0.25* n) z_p_value <-2* (1-pnorm(z_score)) z_p_value
[1] 0.02534732
but if we use the corresponding binomial test instead that would apply for this scenario, we get to quite a different result:
Code
# Number of successes (husbands older than wives) successes <-5# Total number of pairs n <-5# Perform the binomial test test_result <-binom.test(successes, n, p =0.5, alternative ="two.sided") print(test_result)
Exact binomial test
data: successes and n
number of successes = 5, number of trials = 5, p-value = 0.0625
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.4781762 1.0000000
sample estimates:
probability of success
1
Both the two-sample paired t-test and the sign test are used to compare paired data, but they are applied in different situations based on the assumptions and characteristics of the data.
Two-Sample Paired T-Test
When to Use:
Normal Distribution: The differences between the paired observations should be approximately normally distributed.
Interval or Ratio Data: The data should be measured on an interval or ratio scale.
Parametric Test: This test is parametric, meaning it relies on assumptions about the distribution of the data.
Example:
Comparing the blood pressure of patients before and after a treatment.
Measuring the weight of individuals before and after a diet program.
Sign test
Paired Observations: When you have paired data (e.g., before and after measurements) and you want to test if there is a consistent difference between the pairs. For example, testing if a treatment has an effect by comparing measurements before and after the treatment.
Median Differences: When you want to test if the median of differences between pairs is zero. This is useful when the data does not meet the assumptions required for parametric tests like the paired t-test. It does not rely on assumptions about the distribution of the data.
Non-Normal Data: When the data does not follow a normal distribution, making parametric tests inappropriate. The sign test does not assume any specific distribution for the data.
Ordinal Data: When the data is ordinal (ranked) rather than interval or ratio. The sign test can handle data that can only be compared as greater than, less than, or equal to.
Example Scenarios
Medical Studies: Comparing the effectiveness of a treatment by measuring patient conditions before and after the treatment.
Quality Control: Testing if a new manufacturing process consistently produces better results than the old process.
Behavioral Studies: Comparing responses before and after an intervention. Comparing the number of days patients feel better before and after a new medication.
Summary:
Use a paired t-test when the differences between paired observations are normally distributed and you have interval or ratio data.
Use a sign test when the data does not meet the normality assumption, is ordinal, or you prefer a non-parametric approach. :::
8.5 Wilcoxon Rank Sum Test
We learned how the sample mean and SD are susceptible to outliers. The t-test is based on these measures and is susceptible as well. The Wilcoxon rank test (equivalent to the Mann-Whitney test) provides an alternative. In the code below, we perform a t-test on data for which the null is true. However, we change one sum observation by mistake in each sample and the values incorrectly entered are different. Here we see that the t-test results in a small \(p\)-value, while the Wilcoxon test does not:
Code
set.seed(779) ##779 picked for illustration purposesN=25x<-rnorm(N,0,1)y<-rnorm(N,0,1)
The basic idea is to 1) combine all the data, 2) turn the values into ranks 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test.
Data from two populations with two outliers. The left plot shows the original data and the right plot shows their ranks. The numbers are the w values
Code
W <-sum(ws)
W is the sum of the ranks for the first group relative to the second group. We can compute an exact \(p\)-value for \(W\) based on combinatorics. We can also use the CLT since statistical theory tells us that this W is approximated by the normal distribution. We can construct a \(z\)-score as follows:
Here the Z is not large enough to give us a \(p\)-value less than 0.05. These are part of the calculations performed by the R function wilcox.test.
we are not going to get into mathematical detail about these calculations, but the formula for the \(z\) score here is \[
z=\frac{U - \frac{n_2}{2}}{\sqrt{\frac{n_2(n_1+n_2+1)}{12n_1}}}
\] and to perform this test in r we use: wilcox.test(x,y)
8.6 Two-samples of binary data
Two observed proportions for a single binary variable can be compared directly using the two-proportion\(z\)-confidence interval and two-proportion\(z\)-test. These apply when there are at least 5 observations of each type in each of the two groups. The formulas are relatively simple. For example a 95% confidence interval for the difference between proportions is
where \(p_1\) and \(p_2\) are the population proportions, \(\hat p_1\) and \(\hat p_2\) are the observed sample proportions and \(n_1\) and \(n_2\) are the sample sizes. R will use a improved version of this formula when computing the proportions.
In R we will use \(prop.test(n_s , sample size)\) where \(n_s\) is the number of successes.
Examples:
in the attrition dataset. Are the attrition proportions different for the two departments?
2-sample test for equality of proportions with continuity correction
data: yes_counts out of sampleSize
X-squared = 9.9491, df = 1, p-value = 0.001609
alternative hypothesis: two.sided
95 percent confidence interval:
-0.11295996 -0.02280109
sample estimates:
prop 1 prop 2
0.1383975 0.2062780
Our \(p\)-value 0.0016093 is very low which means we can reject the default null hypothesis (the difference in the two samples is 0). It is also giving us the proportions for the samples in the two different departments: 0.1383975, 0.206278 . And the confidence intervals is giving us the difference in the proportion of the two departments (as in prop1 - prop2). In this case being negative means that the second department has a higher attrition rate. The order of the variables is as entered in the vectors, so if yes_counts had Research_Development first and Sales second, prop1 will be for Research_Development and prop2 for Sales
8.7 Two-samples of categorical variables
Two samples of a single categorical variable can be compared with the \(x^2-test\) for homogeneity (Chi-squared). Under the hood, this is just a goodness-of-fit test of the hypothesis that the observed proportions in one of the samples are the same as those in the pooled sample.
8.7.1 Testing homogeneity
The \(\chi^2\) test of homogeneity test the null hypothesis that the distribution of a categorical variable is the same for several populations. It assumes that the samples are drawn independently within and across populations.
See how we can apply this logic to our Titanic survival data:
First
Second
Third
Crew
Survived
202
118
178
215
Died
123
167
528
698
Note that in this case we are not sampling from a population. The data are not a random sample of the people on board, rather the data represent the whole population. Son in this case the chance process resulting in survival or death is not the sampling, but the result of random events occurring when looking for a way out of the ship, like getting into a life boat or into the water, being rescued from the water on time, etc. Then the 325 observations of first class passengers represent 325 independent draws from a probability histogram that gives a certain chance for survival. The 285 observations about second class passengers are drawn from the probability histogram for second class passengers, which may be different. The null hypothesis says that the probability of survival is the same for all four probability histograms. According to this hypothesis we can calculate the probability of survival by pooling all the data = \(\frac{713}{2229}=32\%\) with this number we can calculate the expected number of surviving passengers for each class:
Surviving
First
Second
Third
Crew
Observed
202
118
178
215
Expected
104.0
91.2
225.8
292.1
Died
First
Second
Third
Crew
Expected
221.0
193.8
480.1
620.8
Observed
123
167
528
698
Now we can compare our chi statistic as we learned, using all the differences between expected and observed values:
In this case our degrees of freedom are calculated like this: (4-1)*(2-1) = 3 where 4 is for the number of categories, and 2 is for the two rows of results we are dealing with (surviving and died)
In our case our \(p\) value will be extremely small, suggesting we should reject the null hypothesis that all ticket classes had the same possibility of survival.
Code
x <-192.2df <-3# Calculate the p-valuep_value <-pchisq(x, df, lower.tail =FALSE)# Print the p-valuep_value
Use the substance_abuse data set to decide if the distribution of mental health diagnosis is the same for those with and without at least one psychiatric admission.
Pearson's Chi-squared test
data: t
X-squared = 21.394, df = 3, p-value = 0.00008719
In this case according to our \(p\)-value we cannot say that patients that had at least one psychiatric admission in the past have the same proportion of diagnosis than patients without any admission.
Chi squared test for homogeneity is not saying anything specific about any of these diagnosis, it is simply saying that the distribution of these diagnosis between those two groups (admitted vs not admitted) is not the same.
Is the distribution of SUDx the same for both men and women in our Substance Abuse data?
Code
head(substance_abuse)
# A tibble: 6 × 14
`Admission Date` PPID Program Age Gender RaceEthnicity MHDx SUDx MedDx
<dttm> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2022-01-13 00:00:00 A234… Interv… 34 F Other Depr… Alco… 2
2 2022-02-18 00:00:00 A232… Interv… 26 M NonHispWhite Trau… Opio… 0
3 2022-01-28 00:00:00 A259… Interv… 62 M NativeAm Depr… Opio… 0
4 2022-01-30 00:00:00 A353… Interv… 34 F NonHispWhite Depr… Alco… 0
5 2022-03-28 00:00:00 A302… UsualC… 46 M NonHispBlack Trau… Opio… 0
6 2022-02-17 00:00:00 A315… Interv… 51 M NonHispWhite Anxi… Opio… 1
# ℹ 5 more variables: PsychAdmit <dbl>, DLA1 <dbl>, DLA2 <dbl>,
# DLA_improvement <dbl>, previousAdmit <lgl>
t <-table(substance_abuse$Gender, substance_abuse$SUDx)head(t)
Alcohol None Opioid Stimulant
F 68 36 58 17
M 99 66 104 31
Our null hypothesis is that the distribution between gender is the same.
Code
result <-chisq.test(t)report(result)
Effect sizes were labelled following Funder's (2019) recommendations.
The Pearson's Chi-squared test of independence between and suggests that the
effect is statistically not significant, and tiny (chi2 = 1.24, p = 0.744;
Adjusted Cramer's v = 0.00, 95% CI [0.00, 1.00])
In this case the \(p\)-value of the test is over 0.05 so that indicates that there is no significance difference by sex for the SUDx variable, so men and women are abusing the different categories of substances in the same proportions. The df in our results are the degrees of freedom that is the number of categories minus one.
In reality this is the same as doing a Goodness of fit test as we saw before, if we remember it was \(chisq.test(p_s , p)\) where \(p_s\) was the proportions of the different categories in our sample and \(p\) the proportion of the population we wanted to test against. Back to our example what we are measuring here is the proportion of women or men against the proportion of the full sample, that will tell us if there is a difference between men and women, so we can also write the test like this getting similar results:
Effect sizes were labelled following Funder's (2019) recommendations.
The Chi-squared test for given probabilities / goodness of fit of women to a
distribution of [Alcohol: n=62.4070981210856, None: n=38.1169102296451, Opioid:
n=60.5386221294363, Stimulant: n=17.937369519833] suggests that the effect is
statistically not significant, and tiny (chi2 = 0.77, p = 0.856; Fei = 0.02,
95% CI [0.00, 1.00])
8.7.2 Independence Testing of Categorical Variables
We want to answer this question: Is gender (M/F) related to voting preference (liberal/conservative)? Now we have two categorical variables: gender and voting preference. The null hypothesis is that the two variables are independent. The alternative hypothesis is that there is some kind of association.
The tool we are going to use is the chi-square test ( \(x^2\) ) for independence. This test can be used to test whether two categorical variables are independent or not. That is, whether knowledge about one provides information about the other. The math is exactly the same as for the chi-square test for homogeneity, hence, so is the sintax in most statistical packages, including R, but with a few things to bear in mind:
Technically, the null hypothesis of a \(x^2\) test for independence is that in the larger population, the probability of an observation being in any specific pair of categories is equal to the product of the probabilities of being in each of the individual categories.
This is another omnibus test: it says nothing about particular categories.
Larger cell counts are better, as usual. Do not do this test if any of the counts are less than 5.
We are going to use our product_rating dataset to see the relationship between age groups and the product the user purchased.
first we can just see the contingency table for those two variables
A B C
<20 109 136 346
20-34 104 155 272
35-49 145 149 301
50-64 95 86 227
65+ 75 47 118
Code
chisq.test(t)
Pearson's Chi-squared test
data: t
X-squared = 30.81, df = 8, p-value = 0.0001519
The low \(p\)-value from this test is indicating that the theory that these two categorical variables are independent is implausible.
Chi-Squared is used for testing the independence of categorical variables. It compares observed frequencies in a contingency table to expected frequencies under independence.
Exercises
Referring to the substance_abuse data set:
Is there any evidence of an association between Substance Use Diagnosis and Mental Health Diagnosis among patients receiving an intervention?
Code
treatment <-filter(substance_abuse, Program =="Intervention")t<-table (treatment$SUDx, treatment$MHDx)chisq.test(t)
Pearson's Chi-squared test
data: t
X-squared = 7.6914, df = 9, p-value = 0.5655
According to this result we cannot conclude that there is a relationship between these two variables.
Is there any evidence of an association between SUDx and DLA_improvement among patients receiving an intervention?
Code
ggplot(treatment, aes(x= SUDx, y = DLA_improvement))+geom_boxplot()
Code
testResult <-aov( DLA_improvement ~ SUDx, data = treatment)summary(testResult)
Df Sum Sq Mean Sq F value Pr(>F)
SUDx 3 0.288 0.09606 2.981 0.0336 *
Residuals 139 4.479 0.03222
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference is significant but not by large as the \(p\) value shows that in 3% of the cases we could observe these data differences in a sample from the a population where their category means is the same.
If we run a TukeyHSD test on these we can compare each pair individually and we find out that actually the only pair that shows a statistically significant difference is alcohol with opioids.
It is easy to confuse the testing for homogeneity and the testing for independence.
8.7.3 Comparing the different uses of the chi-square test:
sample
research question
goodness of fit
single categorical variable. one sample
Are the counts of the different categories matching our expected results
homogeneity
single categorical variable measured on several samples
Are the groups homogeneous (have the same distribution of the categorical variable)
independence
two categorical variables measured on one sample
Are the two categorical variables independent?
Examples:
we want to know if there is more births in different week days (Monday, Tuesday…). We record the week day of 300 births. What test should we use to know if there is a difference per day? –> We would use chi-square test for goodness of fit.
A food delivery start-up decides to advertise its service by placing ads on web pages. They wonder whether the percentage of viewers who click on the ad changes depending on how often the viewers were shown the ad. They randomly select 100 viewers from among those who were shown the add once, 135 from among those who were shown the add twice, and 150 from among those who were shown the ad three times. –> We would use chi-square test of homogeneity.
An airline wants to find out whether there is a connection between the customer’s status in its frequent flyer program and the class of ticket that the customer buys. It samples 1,000 ticket records at random and for each ticket notes the status level (‘none’, ‘silver’, ‘gold’) and the ticket class (‘economy’, ‘business’,‘first’). –> chi-square test of independence
A county wants to check whether the racial composition of the teachers in the county corresponds to that of the population in the county. It samples 500 teachers at random and wants to compare that sample with the census numbers about the racial groups in that county. –> chi-square test for goodness of fit
9 Permutation Tests
Suppose we have a situation in which none of the standard mathematical statistical approximations apply. We have computed a summary statistic, such as the difference in mean, but do not have a useful approximation, such as that provided by the CLT. In practice, we do not have access to all values in the population so we can’t perform a simulation. Permutation tests can be useful in these scenarios.
We are back to the scenario were we only have 10 measurements for each group.
In previous sections, we showed parametric approaches that helped determine if the observed difference was significant. Permutation tests take advantage of the fact that if we randomly shuffle the cases and control labels, then the null is true. So we shuffle the cases and control labels and assume that the ensuing distribution approximates the null distribution. Here is how we generate a null distribution by shuffling the data 1,000 times:
Code
N <-12avgdiff <-replicate(1000, { all <-sample(c(control,treatment)) newcontrols <- all[1:N] newtreatments <- all[(N+1):(2*N)]return(mean(newtreatments) -mean(newcontrols))})hist(avgdiff)abline(v=obsdiff, col="red", lwd=2)
How many of the null means are bigger than the observed value? That proportion would be the \(p\)-value for the null. We add a 1 to the numerator and denominator to account for misestimation of the \(p\)-value. The \(p\)-value here is calculated directly as the percentage of values higher than our number of interest.
Code
#the proportion of permutations with larger difference(sum(abs(avgdiff) >abs(obsdiff)) +1) / (length(avgdiff) +1)
[1] 0.05294705
Now let’s repeat this experiment for a smaller dataset. We create a smaller dataset by sampling:
Code
N <-5control <-sample(control,N)treatment <-sample(treatment,N)obsdiff <-mean(treatment)-mean(control)obsdiff
Now the observed difference is not significant using this approach. Keep in mind that there is no theoretical guarantee that the null distribution estimated from permutations approximates the actual null distribution. For example, if there is a real difference between the populations, some of the permutations will be unbalanced and will contain some samples that explain this difference. This implies that the null distribution created with permutations will have larger tails than the actual null distribution. This is why permutations result in conservative \(p\)-values. For this reason, when we have few samples, we can’t do permutations.
Note also that permutations tests still have assumptions: samples are assumed to be independent and “exchangeable”. If there is hidden structure in your data, then permutation tests can result in estimated null distributions that underestimate the size of tails because the permutations may destroy the existing structure in the original data.
10 Multiple test corrections
The Bonferroni correction is a statistical method used to address the problem of multiple comparisons. When you perform multiple hypothesis tests, the chance of making a Type I error (false positive) increases. The Bonferroni correction helps control this by adjusting the significance level.
What we do is, if there are m test, we multiply the \(p\)-values by m. This correction makes sure that P(any of the m test rejects in error) is still smaller than 5%. The Bonferroni correction is often very restrictive. It guards against having even one false positive among the m test. As a consequence the adjusted \(p\)-values may not be significant any more even if a noticeable effect is present, so it will only work if you don’t have a large number of test.
Alternatively if the number of tests is large, it is better to use the False Discovery Proportion (FDP):
where a discovery occurs when a test rejects the null hypothesis. If no hypothesis are rejected, FDP is defined to be 0.
The False Discovery Rate (FDR) is the expected proportion of false discoveries among the discoveries. One common method to control the FDR is the Benjamini-Hochberg procedure. This method adjust the \(p\)-values to control the FDR at a desired alpha level. The procedure is like this:
Sort the \(p\)-values in ascending order. \(p_{(1)}, p_{(2)}, \ldots, p_{(m)}\)
Find the largest k value such as \(p_k=\frac{k}{m}\alpha\)
Reject the null hypothesis for all \(p_i\) where \(i\leq k\)
In r:
Code
# Example p-valuesp_values <-c(0.01, 0.04, 0.03, 0.002, 0.05, 0.20)# Apply the Benjamini-Hochberg procedurep.adjusted <-p.adjust(p_values, method ="BH")print(p.adjusted)
[1] 0.030 0.060 0.060 0.012 0.060 0.200
This will give you the adjusted \(p\)-values, and you can compare them to your desired FDR level to decide which hypotheses to reject.
The validation set approach. With this approach you split your data set into two parts, one is a model building set and a validation set before the analysis. You may use data snooping (multiple testing) in the first set of data to find something interesting. And then you test this hypothesis on the validation set..
Statistical tests make some common assumptions about the data they are testing:
Independence of observations (a.k.a. no autocorrelation): The observations/variables you include in your test are not related (for example, multiple measurements of a single test subject are not independent, while measurements of multiple different test subjects are independent).
Homogeneity of variance: the variance within each group being compared is similar among all groups. If one group has much more variation than others, it will limit the test’s effectiveness.
Normality of data: the data follows a normal distribution (a.k.a. a bell curve). This assumption applies only to quantitative data.
If your data do not meet the assumptions of normality or homogeneity of variance, you may be able to perform a nonparametric statistical test, which allows you to make comparisons without any assumptions about the data distribution.
If your data do not meet the assumption of independence of observations, you may be able to use a test that accounts for structure in your data (repeated-measures tests or tests that include blocking variables).
11.0.2Types of variables
The types of variables you have usually determine what type of statistical test you can use.
Quantitative variables represent amounts of things (e.g. the number of trees in a forest). Types of quantitative variables include:
Continuous (aka ratio variables): represent measures and can usually be divided into units smaller than one (e.g. 0.75 grams).
Discrete (aka integer variables): represent counts and usually can’t be divided into units smaller than one (e.g. 1 tree).
Categorical variables represent groupings of things (e.g. the different tree species in a forest). Types of categorical variables include:
Ordinal: represent data with an order (e.g. rankings).
Nominal: represent group names (e.g. brands or species names).
Binary: represent data with a yes/no or 1/0 outcome (e.g. win or lose).
Choose the test that fits the types of predictor and outcome variables you have collected (if you are doing an experiment, these are the independent and dependent variables). Consult the tables below to see which test best matches your variables.
11.1Choosing a parametric test: regression, comparison, or correlation
Parametric tests usually have stricter requirements than nonparametric tests, and are able to make stronger inferences from the data. They can only be conducted with data that adheres to the common assumptions of statistical tests.
The most common types of parametric test include regression tests, comparison tests, and correlation tests.
11.1.1Regression tests
Regression tests look for cause-and-effect relationships. They can be used to estimate the effect of one or more continuous variables on another variable.
11.1.2Comparison tests
Comparison tests look for differences among group means. They can be used to test the effect of a categorical variable on the mean value of some other characteristic.
T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women). ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults).
Predictor Variable
Outcome Variable
Example
Paired t-test
Categorical
1 predictor
Quantitative
Groups come from the same population
What is the effect of two different test prep programs on the average exam scores for students from the same class?
(Predictor: test program)
Independent t-test
Categorical
1 predictor
Quantitative
Groups come from different populations
What is the difference in average exam scores for students from two different schools?
(Predictor: school A/B)
ANOVA
Categorical
1 or more predictor
Quantitative
1 outcome
What is the difference in average pain levels among post-surgical patients given three different treatments?
MANOVA
Categorical
1 or more predictor
Quantitative
2 or more outcomes
What is the effect of flower species on petal length, petal width and stem length?
Pearson’s r
2 continuous variables
2 continuous variables
How are latitude and temperature related?
11.2 Flowchart for parametric tests
11.2.1 Parametric Tests
Parametric tests are statistical tests that make certain assumptions about the parameters of the population distribution from which the sample is drawn. These assumptions typically include:
Normality: The data follows a normal distribution.
Homogeneity of variance: The variances within each group being compared are similar.
Independence: The observations are independent of each other.
Because of these assumptions, parametric tests are generally more powerful and can make stronger inferences about the population. Some common parametric tests include:
T-tests: Used to compare the means of two groups.
ANOVA (Analysis of Variance): Used to compare the means of three or more groups.
Regression Analysis: Used to examine the relationship between one or more predictor variables and an outcome variable.
11.2.2 Non-Parametric Tests
Non-parametric tests, on the other hand, do not make strict assumptions about the population parameters. They are more flexible and can be used when the assumptions of parametric tests are not met, such as when the data does not follow a normal distribution or when sample sizes are small. Non-parametric tests are also useful for ordinal data or data with outliers.
Some common non-parametric tests include:
Mann-Whitney U Test: Used to compare the medians of two independent groups.
Wilcoxon Signed-Rank Test: Used to compare the medians of two related groups.
Kruskal-Wallis Test: Used to compare the medians of three or more independent groups.
Spearman’s Rank Correlation: Used to assess the relationship between two ordinal variables.
11.2.3 Key Differences
Assumptions: Parametric tests assume specific population parameters, while non-parametric tests do not.
Data Type: Parametric tests are typically used for interval or ratio data, whereas non-parametric tests can be used for ordinal data or data that does not meet parametric assumptions.
Power: Parametric tests are generally more powerful when their assumptions are met, meaning they are more likely to detect a true effect. Non-parametric tests are more robust and flexible but may be less powerful.
11.2.4 Example Scenario
Imagine you want to compare the test scores of students from two different schools. If the test scores are normally distributed and the variances are similar, you would use a parametric test like the independent t-test. However, if the test scores are not normally distributed or have outliers, you might use a non-parametric test like the Mann-Whitney U Test.
11.3 Choosing a non parametric test
Predictor variable
Outcome variable
Use in place of…
Spearman’s r
Quantitative
Quantitative
Pearson’s r
Chi square test of independence
Categorical
Categorical
Pearson’s r
Sign test
Categorical
Quantitative
One-sample t-test
Kruskal–Wallis H
Categorical
3 or more groups
Quantitative
ANOVA
ANOSIM
Categorical
3 or more groups
Quantitative
2 or more outcome variables
MANOVA
Wilcoxon Rank-Sum test
Categorical
2 groups
Quantitative
groups come from different populations
Independent t-test
Wilcoxon Signed-rank test
Categorical
2 groups
Quantitative
groups come from the same population
Paired t-test
Source Code
---title: "Inference"format: htmleditor: visual---```{r}#| echo: falselibrary(dplyr)library(tidyverse)library(here)library(readxl)library(easystats)library(infer)library(kableExtra)library(plotly)library(ggplot2)library(patchwork)library(DiagrammeR)theme_set(theme_minimal())options(scipen=999)```# Introduction to Inference```{r, echo=FALSE}#| echo: falsefile <- here::here("data", "optical_full.xlsx")optical <-read_excel(file)```Drawing conclusions from data is difficult because we almost never have complete information about any variable of interest. For instance, we may have only been able to send a satisfaction survey to a small subset of our customers.- The **population** of a study consists of all possible observations of interest.- A **sample** in a study consists of all the observations we actually have.We are almost always interested in a population, but only have information about a sample. Statistical inference is the process of reasoning from the one to the other.- A **parameter** is a number that describes a population- A **statistic** is a number that describes a sampleThe mean of the eye difference variable in our optical dataset is `r mean(optical$eye_difference)`. The **sample mean** is found by adding together all the values of a variable in the sample data, and dividing this total by $n$, the sample size.If we extract 100 samples with three observations each from our data and visually show their means we can see every time we got a different value, sometimes they will be very close to the target parameter, but sometimes they can be quite far.```{r, fig.align='center', echo=FALSE}#| eval: truekable(head(optical))%>%kable_styling(latex_options ="scale_down")%>%landscape()as.data.frame(report(optical$eye_difference))sample <-slice_sample(optical, n =3)mean(sample$eye_difference)samples <-rep_slice_sample(optical, n =3, reps =100)sample_summary <- samples |>group_by(replicate) |> dplyr::summarize(sample_mean =mean(eye_difference))ggplot(sample_summary, aes(x = sample_mean)) +geom_histogram() +geom_vline(xintercept =mean(optical$eye_difference),linetype ="dotted")```**Variability** is the natural tendency for a statistic to be different across different samples.A **Biased statistics**, on the other hand, misses the mark on average, for example if we increase the value of all our sample means by 1:```{r, fig.align='center', echo=FALSE}ggplot(sample_summary, aes(x = sample_mean+1)) +geom_histogram() +geom_vline(xintercept =mean(optical$eye_difference),linetype ="dotted")```::: {.callout-orange appearance="simple" icon="false"}**Reproducibility and Replicability**Replicating means to get the same conclusion with slightly different samples, procedures, and data analysis methods.Related to that is the problem of reproducibility. That means to get the same results then you simply use the same data and same methods that were claimed in the analysis.:::::: {.callout-orange appearance="simple" icon="false"}::: centered-text- Shortcomings of statistical analysis \*:::Statistical inference can handle variability, but not bias. Bias comes from incorrect data collection and only good data collection and management practices can prevent bias.Statistical inference can lead to wildly wrong conclusions when misused or abused for example:- Sample bias: not choosing the sample randomly.- Undercoverage: your procedure miss out keep portions of the population.- Overgeneralization: you need to limit the conclusions that you draw to the population that you have sampled from.Significance test are particulary easy to abuse:- If you run a test just because you've spotted a pattern in your data, you're almost certain to get a positive result even if the trend is just due to random chance. Patterns exist just by chance.- p-hacking: repeated testing of the same hypothesis will eventually yield a positive result just by random chance.- Confusing significance for importance. Just because a difference is significant does not mean it is important.No amount of statistical savvy can fully remediate bad data. All of the techniques we'll discuss in this course are designed to address variability in data, not bias. Before attempting any sort of analysis, you should always take some time to think critically about the overall quality of the data. A few questions to consider:- Is the sample representative of the population of interest, or might some subgroups be under or over represented?- How was the data collected? Are any of the variables self-reported? Were there points in the process where those collecting the data made subjective judgments?- How will conclusions from the data analysis be used? How might they be misinterpreted or abused?:::# Expected Value and Standard Error:If we are interested in the average height of adult males in USA and we extract an adult male at random, we expect his height to be around the population average, give or take about one standard deviation $\sigma$ . **The expected value is the population average** $\mu$ . The expected value of the sample average is still the population average, but remember that the population mean is a random variable because sampling is a random process, so its mean won't be exactly the population mean. How far off can we expect this sample mean to the population mean?**The standard error (SE)** of a statistic tell you roughly how far off the sample mean will be from its expected value (population mean). It indicates how much the sample mean is expected to vary from the true population mean.$$ \text{SE} (\bar{x_n})= \frac{\sigma}{\sqrt{n}} $$ {#eq-standardErrorMean}where n is the sample size. **We can actually use this formula to decide what is the sample size we need to use for a desired accuracy in our statistic**.It plays a similar role as the *standard deviation* for one observation from the population mean, remember that the Standard deviation measures the dispersion of individual data points around the mean of a dataset. It tells you how spread out the values in a dataset are. In summary, the standard error of an estimate is the standard deviation of the sampling distribution of an estimate.::: exercise-box**Exercise:***A town has 10,000 registered voters, of whom 6,000 are voting for the Democratic party. A survey organization is taking a sample of 100 registered voters (assume sampling with replacement). The percentage of Democratic voters in the sample will be around \_\_\_\_\_, give or take \_\_\_\_. (You may use the fact that the standard deviation is about 0.5)*The percentage of Democratic voters in the sample is equal to the mean of the sample, and so the "around" value is the **expected value of the sample mean**, which in turn is equal to the population mean, 0.60 or 60%. Similarly, the "give or take" value is the *standard error* (SE) of the mean, which is equal to $\frac{\sigma}{\sqrt{n}}$ , where σ is the standard deviation of the population and n is the size of the sample. We're told that σ is about 0.5 and n is 100, so the standard error of the mean is 0.05 or 5%.:::### Expected value and standard error for the sumIf we are interested in the sum of n draws $S_n$, the sum and the average are related by $S_n = n\bar{x_n}$ so the standard error of the sum: $$SE(S_n)=\sqrt{n}\sigma$$ {#eq-standardErrorSum}which tells us that the variability of the sum of n draws increases at the rate of $\sqrt{n}$ so **while increasing the sample size reduces the standard error of the average, it actually increases the standard error for the sum.**::: exercise-boxExercise1 (Expected Value):*You solicit 100 pledges for a charitable organization. Each pledge is equally likely to be \$10, \$50, or \$100. You may use the fact that the standard deviation of the three amounts \$10, \$50 and \$100 is \$37.**What is the expected value of the sum of the 100 pledges?*The expected value of the sum of a sample is nμ, where n is the size of the sample and μ is the mean of the population. Here we're told that n is 100 and μ must be the average of \$10, \$50, and \$100, which is \$53.33, so the expected value of the sample sum is \$5333.:::::: exercise-boxExercise2 (Standard error for the sum)*You solicit 100 pledges for a charitable organization. Each pledge is equally likely to be \$10, \$50, or \$100. You may use the fact that the standard deviation of the three amounts \$10, \$50 and \$100 is \$37.**What are the chances that the 100 pledges total more than \$5,700 ?*The standard error of the sample sum is $SE(S_n)=\sqrt{n}\sigma$ where n is the sample size and σ is the standard deviation of the population.Since n is 100 and we're told that the standard deviation of the population is \$37, we know the standard error of the sample sum is \$370.Using the normal approximation to the sampling distribution, this tells us that \$5700 is roughly one standard deviation (5700 - 5333 = 366) above the mean of the sampling distribution. From the **empirical rule**, we know that one half of 68% of the data lies within one standard deviation to the right of the mean, so that 50 + (1/2)68 = 84% of the data lies to the left of \$5700. That is, 16% percent of the data lies to right of \$5700.:::### Expected value and standard error for percentages.For percentages, the expected value is $$E = \mu * 100\%$$ {#eq-expectedValuePerc} and the Standard Error $$SE=\frac{\sigma}{\sqrt{n}} *100\%$$ {#eq-standardErrorPerc}All the above formulas are for sampling with replacement, but they are still approximately true when sampling without replacement if the sample size is much smaller than the size of the population.------------------------------------------------------------------------::: exercise-boxExercise:*We toss a coin 100 times. How many tails do you expect to see?*we assing tails a value of 1, and heads a value of 0.p(1)=0.5, p(0)=0.5Number of expected tails = $E(sum) = 100 * \mu$and $\mu = 0*p(0)+1*p(1)$ , so $\mu =0.5$ and $E= 100*0.5$ =50 tails.The Standard Error will be $SE(sum) = \sqrt{100}*\sigma$to calculate sigma, we know that $\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2}$ where:- ($\sigma$) is the standard deviation,- ($N$) is the total number of observations,- ($X_i$) represents each individual observation,- ($\mu$) is the mean of the observations.In our exercise, without having to toss the coing 100 times we calculate it like this:$\sigma =\sqrt{(x_1-\mu)^{2} * p(1) +(x_2-\mu)^{2}*p(2)}$$\sigma =\sqrt{(0-0.5)^{2} * 0.5 +(1-0.5)^{2}*0.5}=0.5$$\sigma^2 ={(0-0.5)^{2} * 0.5 +(1-0.5)^{2}*0.5}=0.25$$SE(sum) = \sqrt{100}\sigma^2=10*0.25=5$so our answer is that we expect to see 50 tails, give or take 5 tails.and in percentages we would expect to see E= 0.5\*100 =50% tails with an standard error of SE =0.5/10\*100 = 5%in r code, an example of the same problem with n=200 tosses. See the difference in the percentages.```{r}# Define the probabilities p <-c(0.5, 0.5) # Define the values (0 for heads, 1 for tails) x <-c(0, 1) # Calculate the mean (μ) mu <-sum(x * p) # Calculate the variance (σ^2) variance <-sum((x - mu)^2* p) # Calculate the standard deviation (σ) sigma <-sqrt(variance) # Number of coin tosses n <-200# Calculate the standard error (SE) SE <-sqrt(n) * sigma cat("Expected percentage of tails:", n * mu, "\n") cat("Standard error:", SE, "\n") cat("Expected number of tails:", mu*100, " % \n") cat("Expected standard error percentage:", sigma /sqrt(n), "% \n")```:::------------------------------------------------------------------------## The sampling distributionIf we toss a fair coin 100 times, we can get any number of tails from 0 to 100. How likely is each outcome?The number of tails has the binomial distribution with `n=100` and `p=0.5` where `success = tails`. So if the statistic of interest is $S_n=number\ of \ tails$ , then $S_n$ is a random variable whose probability histogram is given by the binomial distribution. This is called the *sampling distribution of the statistic* $S_n$. The sampling distribution provides more detailed information about the chance properties of $S_n$ than the summary numbers given by the expected value and the standard error alone.There are three histograms to take into consideration:### Probability histogramThe probability histogram for producing the data. Since both tails and heads have the same probability, our histogram would be two columns with the same height.```{r, fig.align='center'}#| echo: falseheads =rep(0,50)tails =rep(1,50)theoreticalP =c(heads,tails)theoreticalP# Convert the vector to a data framedf <-data.frame(theoreticalP)# Create the histogramgg <-ggplot(df, aes(x = theoreticalP)) +geom_histogram(binwidth =0.5, fill ="blue", color ="black") +labs(title ="Histogram of Theoretical Probabilities", x ="Value", y ="Frequency")# Print the plotprint(gg)```### Empirical histogramIf we toss the coins, we will have am empirical histogram```{r, fig.align='center', echo=FALSE}# Set the number of coin tossesnum_tosses <-100# Simulate the coin tossescoin_tosses <-sample(c(0, 1), size = num_tosses, replace =TRUE, prob =c(0.5, 0.5))# Create a dataframedf <-data.frame(toss =1:num_tosses, result = coin_tosses)# Create the histogramgg <-ggplot(df, aes(x = result)) +geom_histogram(binwidth =0.5, fill ="blue", color ="black") +labs(title ="empirical histogram", x ="Value", y ="Frequency")# Print the plotprint(gg)```### Probability histogram of the sampleAnd finally we can do the probability histogram of the statistic $s_{100}$, which shows the sampling distribution:```{r, fig.align='center' , echo=FALSE}# Set the parametersn <-100p <-0.5# Calculate the probabilities for each number of tails (0 to 100)tails <-0:nprobabilities <-dbinom(tails, size = n, prob = p)# Create a dataframedf <-data.frame(tails = tails, probability = probabilities)# Create the histogramgg <-ggplot(df, aes(x = tails, y = probability)) +geom_bar(stat ="identity", fill ="blue", color ="black") +labs(title ="Probability Distribution of Number of Tails in 100 Coin Tosses",x ="Number of Tails",y ="Probability") +theme_minimal()# Print the plotprint(gg)```# Central limit theorem (CLT)The Central Limit Theorem is one of the most important results in statistics, basically saying that with sufficiently large sample sizes, the sample average approximately follows a normal distribution. This underscores the importance of the normal distribution, as well as most of the methods commonly used which make assumptions about the data being normally distributed.Let's stop and think about what it means for the sample average to have a distribution. Imagine going to the store and buying a bag of your favorite brand of chocolate chip cookies. Suppose the bag has 24 cookies in it. Will each cookie have the exact same number of chocolate chips in it? It turns out that if you make a batch of cookies by adding chips to dough and mixing it really well, then putting the same amount of dough onto a baking sheet, the number of chips per cookie closely follows a Poisson distribution. (In the limiting case of chips having zero volume, this is exactly a Poisson process.) Thus we expect there to be a lot of variability in the number of chips per cookie. We can model the number of chips per cookie with a Poisson distribution. We can also compute the average number of chips per cookie in the bag. For the bag we have, that will be a particular number. But there may be more bags of cookies in the store. Will each of those bags have the same average number of chips? If all of the cookies in the store are from the same industrial-sized batch, each cookie will individually have a Poisson number of chips. So the average number of chips in one bag may be different from the average number of chips in another bag. Thus we could hypothetically find out the average number of chips for each bag in the store. And we could think about what the distribution of these averages is, across the bags in the store, or all the bags of cookies in the world. It is this distribution of averages that the central limit theorem says is approximately a normal distribution, with the same mean as the distribution for the individual cookies, but with a standard deviation that is divided by the square root of the number of samples in each average (i.e., the number of cookies per bag).When sampling with replacement and $n$ is large, then the sampling distribution of the sample average approximately follows the normal curve.For example, going back to the example of of the online gambling where we had a probability of winning a small prize with $p=0.2$, we can see how the probability histogram for the binomial distribution approximates more and more to the normal distribution as we increase the number of trials $n$.```{r, fig.align='center', echo=FALSE, fig.width=12}# Set the parametersn <-1p <-0.2# Calculate the probabilities for each number of tails (0 to 100)success <-0:nprobabilities <-dbinom(success, size = n, prob = p)# Create a dataframedf <-data.frame(success = success, probability = probabilities)# Create the histogramgg <-ggplot(df, aes(x = success, y = probability)) +geom_bar(stat ="identity", fill ="blue", color ="black") +labs(title ="Probability histogram for the binomial n=1",x ="Number of successes with n=2",y ="Probability") +theme_minimal()n <-10p <-0.2# Calculate the probabilities for each number of tails (0 to 100)success <-0:nprobabilities <-dbinom(success, size = n, prob = p)# Create a dataframedf <-data.frame(success = success, probability = probabilities)gg2 <-ggplot(df, aes(x = success, y = probability)) +geom_bar(stat ="identity", fill ="blue", color ="black") +labs(title ="Probability histogram for the binomial n=10",x ="Number of successes with n=10",y ="Probability") +theme_minimal()n <-100p <-0.2# Calculate the probabilities for each number of tails (0 to 100)success <-0:nprobabilities <-dbinom(success, size = n, prob = p)df <-data.frame(success = success, probability = probabilities)gg3 <-ggplot(df, aes(x = success, y = probability)) +geom_bar(stat ="identity", fill ="blue", color ="black") +labs(title ="Probability histogram for the binomial n=100",x ="Number of successes with n=100",y ="Probability") +theme_minimal()combinedPlots<- gg + gg2+ gg3combinedPlots```That means that we can use normal approximation to compute probabilities. The key point of the theorem is that we know that the sampling distribution of the statistic is normal no matter what the population histogram is.The mathematical theory behind that:The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average $\bar{y}$ of a random sample follows a normal distribution centered at the population average μ and with standard deviation equal to the population standard deviation σ, divided by the square root of the sample size N. We refer to the standard deviation of the distribution of a random variable as the random variable's standard error. We have seen this formula before when studying the Standard Error (@eq-standardErrorMean): $$ \text{SE} (\bar{x_n})= \frac{\sigma}{\sqrt{n}} $$This is telling us that when we take more samples and plot their results in a histogram, the spread of the histogram becomes smaller (closer to the population average) as we saw in the histograms above.**Standardized sample mean**This implies that if we take many samples of size $N$, then the quantity is approximated with a normal distribution centered at 0 and with standard deviation 1$$\frac{\bar{Y} - \mu}{\sigma_Y/\sqrt{N}}$$ {#eq-standardizedSampleMean}- $\bar{Y}$: The sample mean, which is the average of your sample data.- $\mu$: The population mean, which is the average of the entire population from which the sample is drawn.- $\sigma_Y$: The population standard deviation, which measures the spread of the population data.- $N$: The sample size, or the number of observations in your sample.The standardized sample mean is a way to transform the sample mean into a standard normal distribution (mean 0, standard deviation 1). Standardizing the sample mean by subtracting the population mean and dividing by the standard error $\sigma_Y/\sqrt{N}$ allows us to compare it to the standard normal distribution. This is useful because the properties of the normal distribution are well understood and widely applicable in statistical inference.**Central limit theorem in practice.** The central limit theorem allow us, when we don't have access to the population data, to use the normal approximation so we can compute $p$-values.We are going to use the mice bodyweight data for our example. We willwork only with female mice because the bodyweight of female and male mice are different. We calculate the mean and the standard deviation of control and treatment groups and those will be the parameteres of our population.The CLT tells us that when the samples are large, the random variable is normally distributed. Thus we can compute $p$-values using the function `pnorm````{r miceexampleCLT}dat <-read.csv("data/mice_pheno.csv") table(dat$Sex, dat$Diet)controlPopulation <-filter(dat,Sex =="F"& Diet =="chow") %>% dplyr::select(Bodyweight) %>% unlisthfPopulation <-filter(dat,Sex =="F"& Diet =="hf") %>% dplyr::select(Bodyweight) %>% unlistmu_hf <-mean(hfPopulation)mu_control <-mean(controlPopulation)print(mu_hf - mu_control)sd_hf <-sd(hfPopulation)sd_hfsd_control <-sd(controlPopulation)sd_control```Remember that in practice we do not get to compute these population parameters. These are values we never see. In general, we have to estimate them from samples.As we described, the CLT tells us that for large N, each of these is approximately normal with average population mean and standard error population variance divided by N. We mentioned that a rule of thumb is that N should be 30 or more. However, that is just a rule of thumb since the preciseness of the approximation depends on the population distribution. Here we can actually check the approximation and we do that for various values of N.We are going to create 10,000 samples of N number of mice for N= 3,12,25 and 50 and calculate the difference for the means between treatment and control bodyweight. We then calculate the t-statistic for each sample size (t-statistic is explained later)Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.This ratio is what we call the t-statistic.$$t = \frac{\bar{x} - \mu}{SE}$$ {#eq-tstatistic} where $SE = \frac{s}{\sqrt{n}}$In the specific case of our experiment (comparing the means of two samples) this translates into:$$ t = \frac{\bar{y} - \bar{x}}{SE}=\frac{\bar{y} - \bar{x}}{\sqrt{\frac{s_y^2}{n} + \frac{s_x^2}{n}}} $$Finally we will plot a QQ plot for each sample size against the normal distribution to see their fit:```{r, fig.align='center', fig.width=9, fig.height=9}Ns <-c(3,12,25,50)B <-10000#number of simulations##function to compute a t-statcomputetstat <-function(n) { y <-sample(hfPopulation,n) x <-sample(controlPopulation,n) (mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)}res <-sapply(Ns,function(n) {replicate(B,computetstat(n))})par(mfrow=c(2,2))for (i inseq(along=Ns)) {qqnorm(res[,i],main=Ns[i])qqline(res[,i],col=2)}```So we see that for $N=3$, the CLT does not provide a usable approximation. For $N=12$, there is a slight deviation at the higher values, although the approximation appears useful. For 25 and 50, the approximation good.This simulation only proves that $N=12$ is large enough in this case, not in general.The further the population distribution is from the normal distribution, the higher the sample size we will require.For populations that are already normally distributed, even small sample sizes will result in sample means that are approximately normally distributed. For populations that are not normally distributed, larger sample sizes are required for the sample means to approximate a normal distribution.We will see later that we don't usually calculate our $p$-values like this using the CLT, instead we will use a *t-test* that will give us a slightly different result. This is to be expected because our CLT approximation considered the denominator of t-stat practically fixed (with large samples it practically is), while the t-distribution approximation takes into account that the denominator (the standard error of the difference) is a random variable. The smaller the sample size, the more the denominator varies.::: {.callout-orange appearance="simple" icon="false"}**When does the central limit theorem apply?** For the normal approximation to work, the key requirements are:- we sample with replacement, or we simulate independent random variables from the same distribution. (If the sample size is much smaller than the population size, then sampling with replacement and sampling without replacement is approximately the same so this also applies)- The statistic of interest is a sum, average or percentage.- The sample size is large enough. The more skewed the population histogram is, the larger the required sample size n. If there is no strong skewness, then n=15 is sufficient.:::::: exercise-boxExercise:Suppose we are interested in the proportion of times we see a 6 when rolling n=100 dice. This is a random variable which we can simulate with x=sample(1:6, n, replace=TRUE) and the proportion we are interested in can be expressed as an average: mean(x==6). Because the die rolls are independent, the CLT applies.We want to roll n dice 10,000 times and keep these proportions. This random variable (proportion of 6s) has mean p=1/6 and variance $\frac{p(1-p)}{n}$. So according to the CLT: $$z = \frac{observed_p - p}{\sqrt{(p\times(1-p))/n)}}$$$z$ should be normal with mean 0 and SD 1.Perform the simulation, and report what proportion of times $z$ was larger than 2 in absolute value (CLT says it should be about 0.05).```{r, fig.align='center', fig.width=10}n=100N=10000p=1/6SE<-sqrt(p*(1-p)/n)set.seed(1)res<-replicate(N,sample(1:6,n,replace=TRUE))prop =apply(res, 2, function(x) mean(x ==6))par(mfrow =c(1,2))hist(prop, freq=TRUE)z_scores <- (prop-p)/SEplot(z_scores,dnorm(z_scores))proportion_greater_than_2 <-mean(abs(z_scores) >2)print(proportion_greater_than_2)```:::# Significance test vs Confidence intervalsThere are two fundamental sorts of questions we might ask our sample data:- How can we estimate a parameter while taking into account the variability of the data in an honest way?- How can we know if our data supports a specific conclusion, given its inherent variability?The first of these questions leads to the idea of a **confidence interval**, where we specify not only an estimate of a parameter but also a reasonable margin of error. The latter question gives rise to **significance testing**.::: {#confidenceIntvsSignifTest style="border: 2px solid #f0ad4e; border-radius: 8px; background-color: #fff3cd; padding: 10px;"}- Use a confidence interval if you wish to estimate a parameter from a sample in a way that describes not only the observed mean but also the uncertainty surrounding it.- Use a significance test if there is one particular value of interest, for instance representing whether a piece of machinery is properly calibrated.:::Significance tests and confidence intervals are two sides of the same coin.*Significance tests* consider specific individual values, while *confidence intervals* give more general information about the parameter of interest. The latter are more flexible and are generally easier to interpret. Consider a significance test only when there is a single parameter value of interest which you could identify before looking at the data.# Confidence intervalsA confidence interval is a way of reporting an estimate of a parameter that includes information about how much variability could reasonably be expected due to random chance. A confidence interval for the mean of a quantitative variable has the form$$\mu = \bar{x} \pm ME$$where $\mu$ is the population mean, $\bar{x}$ is the observed sample mean and $ME$ is a **margin of error.**## Manual calculation:A confidence interval gives a range of plausible values for a population parameter. Usually the confidence interval is centered at an estimate for $\mu$ which is an average. Since the central limit theorem applies for averages, the confidence interval has a simple form:$$CI = estimate \mp ME$$where **ME is the margin of error** and can be decomposed like this:$$CI=estimate \mp z\ * SE = \mu \mp z*SE$$ {#eq-confidenceInterval}where SE is the *standard error* of the sample. If that is an average or a percentage then it is $SE= \frac{\sigma}{\sqrt{n}}$ as we saw in (@eq-standardErrorMean).And $z$ is the $z$-score corresponding to the desired confidence level. For example, if we would like to have a 95% confidence level, our $z=1.96$This comes from calculating the value of $z$ where 95 of our data is in the middle:```{r, fig.align='center', echo=FALSE}x <-seq(-4, 4, length=100)# Calculate the corresponding y values for the standard normal distributiony <-dnorm(x)# Plot the bell curveplot(x, y, type="l", lwd=2, col="blue", xlab="Z", ylab="Density", main="Standard Normal Distribution")# Highlight the area between -1.96 and 1.96polygon(c(-1.96, seq(-1.96, 1.96, length=100), 1.96), c(0, dnorm(seq(-1.96, 1.96, length=100)), 0), col=rgb(0.1, 0.1, 0.9, 0.2), border=NA)# Add vertical lines at z = ±1.96abline(v =1.96, col ="red", lwd =2, lty =2)abline(v =-1.96, col ="red", lwd =2, lty =2)# Add annotation at z = 1.96text(1.96, dnorm(1.96), labels ="z = 1.96", pos =4, col ="red")# Add annotation for 95% confidence intervaltext(0, 0.1, labels ="95%", pos =3, col ="blue")```To get to that $z$ value we use tables or we can use the quantile function `qnorm`:```{r}# Calculate the z-score for a 95% confidence intervalz_score <-qnorm(0.975)z_score```The `0.975`value is used because for a 95% confidence interval, you need to capture the central 95% of the distribution, leaving 2.5% in each tail. Therefore, you look up the 97.5th percentile (0.975) to get the $z$-score.For 90% confidence interval we are looking for the $z$ value in the normalized distribution where 90% of the data falls in our center range and 10% outside, so 5% in each tail.```{r}# Calculate the z-score for a 95% confidence intervaldesiredConfidence <-90tails = (100-desiredConfidence )/2percentileOfinterest <- desiredConfidence+tailsz_score <-qnorm(percentileOfinterest/100)z_score```For a 99 confidence level:```{r}# Calculate the z-score for a 95% confidence intervaldesiredConfidence <-99tails = (100-desiredConfidence )/2percentileOfinterest <- desiredConfidence+tailsz_score <-qnorm(percentileOfinterest/100)z_score```::: exercise-boxConfidence Intervals for Coin Flips**Example 1: Flipping a Coin 100 Times**In this example of flipping a coin 100 times, we observed 44 heads, resulting in the following 95% confidence interval for $p$:$$CI= \hat{p} \pm z \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$$where: - $\hat{p}$ is the observed proportion of heads - $z$ is the critical value from the standard normal distribution for a 95% confidence level (approximately 1.96) - $n$ is the number of trials (100 in this case) \$ p = \frac{44}{100} \$```{r}n1 <-100x1 <-44p_hat1 <- x1 / n1z <-1.96# Confidence interval calculation(ci_lower1 <- p_hat1 - z *sqrt(p_hat1 * (1- p_hat1) / n1))(ci_upper1 <- p_hat1 + z *sqrt(p_hat1 * (1- p_hat1) / n1))```p: (0.343, 0.537)From this, we concluded that it is plausible that the coin may be fair because $p = 0.5$ is in the interval.**Example 2: Flipping a Coin 100,000 Times**Suppose instead that we flipped the coin 100,000 times, observing 44,000 heads (the same percentage of heads as before). Then using the method just presented, the 95% confidence interval for $p$ is:```{r}# Example 2: Flipping a Coin 100,000 Timesn2 <-100000x2 <-44000p_hat2 <- x2 / n2# Confidence interval calculation(ci_lower2 <- p_hat2 - z *sqrt(p_hat2 * (1- p_hat2) / n2))(ci_upper2 <- p_hat2 + z *sqrt(p_hat2 * (1- p_hat2) / n2))```p: (0.437, 0.443)*Is it reasonable to conclude that this is a fair coin with 95% confidence?* Based on this narrower confidence interval, it appears less likely that the coin is fair since $p = 0.5$ is not within the interval.```{r, echo=FALSE}# Data for plottingdata <-data.frame(Example =c("100 Flips", "100,000 Flips"),p_hat =c(p_hat1, p_hat2),CI_Lower =c(ci_lower1, ci_lower2),CI_Upper =c(ci_upper1, ci_upper2))# Plot the confidence intervalsggplot(data, aes(x = Example, y = p_hat)) +geom_point() +geom_errorbar(aes(ymin = CI_Lower, ymax = CI_Upper), width =0.2) +labs(title ="95% Confidence Intervals for Coin Flips",x ="Example",y ="Proportion of Heads") +theme_minimal()```:::**Estimating the SE with bootstrap principle**We still have a problem for calculating the confidence interval following this, and it is that we need to know the standard deviation of the population $\sigma$ and we usually don't know it, but the [bootstrap principle](#bootstrap) states that we can calculate sigma by its sample version $s$ and still get an approximately correct confidence interval.::: exercise-boxExample: We poll 1000 likely voters and find that 58% approve of the way the president handles his job.$SE= \frac{\sigma}{\sqrt{n}} * 100$ where $\sigma = \sqrt{p(1-p)}$ where $p$ is the proportion of **all voters** who approve, but we don't know p, but the bootstrap principle tells us that we can replace $\sigma$ by $s$ so we can plug in the values from our survey here: $=\sqrt{0.58(1-0.58)} = 0.49$So a 95% confidence interval for $p$ is: $$58\% \mp 1.96 \frac{0.49}{\sqrt{1000}}*100$$which is approximately \[54.9%,61.1%\]:::The width of the confidence interval is determined by $z$ and the standard error SE. To reduce that *margin of error* we have two options, we can increase the sample size or decrease the confidence level. The sample size is square rooted so this means that to cut the width of the confidence level in half we need four times the sample size, and to reduce it 10 times we would need 100 times the sample size.## Getting the confidence intervals from a test result```{r}#simulate our survey resultssuccess<-rep(1,580)failure<-rep(0,420)mydata <-c(success,failure)#do a te.test.testResult <-t.test(mydata)confInt95_low <- testResult$conf.int[1]confInt95_upp <- testResult$conf.int[2]confInt95_uppcat('confidence interval = [',confInt95_low,confInt95_upp,']')margin_of_error <- (confInt95_upp - confInt95_low) /2cat('margin of error: ',margin_of_error)```For example in our optical dataset, if we do a t.test over that variable we can get the confidence intervals for a level of confidence 95%:```{r, fig.align='center'}file1 <- here::here("data","optical_sample.xlsx")optical_sample <-read_excel(file1)testResult <-t.test(optical_sample$eye_difference)confInt95_low <- testResult$conf.int[1]confInt95_upp <- testResult$conf.int[2]margin_of_error <- (confInt95_upp - confInt95_low) /2ggplot(optical_sample, aes(x = eye_difference)) +geom_histogram() +geom_vline(xintercept =mean(optical_sample$eye_difference),linetype ="dashed")+geom_vline(xintercept = confInt95_low,linetype ="dotted", color ="green")+geom_vline(xintercept = confInt95_upp,linetype ="dotted", color ="green")```the margin of error will be:```{r}margin_of_error <- (confInt95_upp - confInt95_low) /2margin_of_error```Now we can calculate the confidence intervals for other 90% and 99% level of confidence and see their differences in a graph:```{r, fig.align='center'}#level of confidence 90%. testResult <-t.test(optical_sample$eye_difference,conf.level = .90)confInt90_low <- testResult$conf.int[1]confInt90_upp <- testResult$conf.int[2]#level of confidence 99%. testResult <-t.test(optical_sample$eye_difference,conf.level = .99)confInt99_low <- testResult$conf.int[1]confInt99_upp <- testResult$conf.int[2]ggplot(optical_sample, aes(x = eye_difference)) +geom_histogram() +geom_vline(xintercept =mean(optical_sample$eye_difference),linetype ="dashed")+geom_vline(xintercept = confInt90_low,linetype ="dotted", color ="orange")+geom_vline(xintercept = confInt90_upp,linetype ="dotted", color ="orange")+geom_vline(xintercept = confInt95_low,linetype ="dotted", color ="green")+geom_vline(xintercept = confInt95_upp,linetype ="dotted", color ="green")+geom_vline(xintercept = confInt99_low,linetype ="dotted", color ="yellow")+geom_vline(xintercept = confInt99_upp,linetype ="dotted", color ="yellow")```Three factors feed into the size of the margin of error:1. The **confidence level of the interval**. That is, the proportion of the time our interval would correctly capture the parameter of interest. Higher confidence requires a larger margin of error.2. The **spread of the observations** in the data set. More spread in the data implies more sampling variability and therefore a larger margin of error.3. The **size of the sample**.::: exercise-boxExercises:Using the pm_sample data set,1. Find a 95% confidence interval for the mean salary of the project managers in this population. Interpret the results in plain language. Would a 99% confidence interval be wider of more narrow?2. Find a 95% confidence interval for the man non-salary compensation of project managers in this population. Why is the margin of error different than in the first problem?```{r}file <- here::here("data", "pm_survey.xlsx") pm_survey <-read_excel(file) kable(head(pm_survey))%>%kable_styling(latex_options ="scale_down")%>%landscape()as.data.frame(report(pm_survey$annual_salary))t.test(pm_survey$annual_salary) ```The confidence interval at the (default) 95% is `r t.test(pm_survey$annual_salary)$conf.int` which means in plain language that if we were to repeat the sampling and the test multiple times, this confidence interval would capture the true value of the parameter 95% of the time. Although not extrictly true, we can say that there's a 95% chance that this confidence interval includes the true population mean.For the second exercise```{r}pm_survey <- pm_survey |>mutate(other_monetary_comp =replace_na(other_monetary_comp, 0)) as.data.frame(report(pm_survey$other_monetary_comp))t.test(pm_survey$other_monetary_comp) ```our confidence interval is `r t.test(pm_survey$other_monetary_comp)$conf.int` and so, the margin of error in both examples are:```{r}monetary_marginError<- (t.test(pm_survey$annual_salary)$conf.int[2] -t.test(pm_survey$annual_salary)$conf.int[1]) /2monetary_marginErrornonMonetary_marginError<- (t.test(pm_survey$other_monetary_comp)$conf.int[2] -t.test(pm_survey$other_monetary_comp)$conf.int[1]) /2nonMonetary_marginError```The difference in both is due to the variability of the data, that we can calculate using the standar deviation:```{r}sd(pm_survey$annual_salary) sd(pm_survey$other_monetary_comp)```:::### Interpreting confidence intervals: practical exampleWe are going to calculate the confidence interval for 100 samples of 30 observations calculated over the same population and plot it on a graph with a line marking the true mean. We will see how some of them (for a 95% confidence level it should be around 5% of the times) will still miss the population parameter:```{r, fig.align='center', fig.width=6}#calculate the sample mean and confidence interval from a sample of 100 low =numeric()high =numeric()for (n in1:100){ sample <-slice_sample(optical, n =30) test <-t.test(sample$eye_difference) low[n] <- test$conf.int[1] high[n] <- test$conf.int[2]}ci_reps <-data.frame("replicate"=1:100, low, high)ggplot(ci_reps, aes(x = low, xend = high,y = replicate, yend = replicate)) +geom_segment() +geom_vline(xintercept =mean(optical$eye_difference), linetype ="dashed")```When interpreting a confidence interval, remember that not all values in an interval are equal. The center is always more likely than the ends.# Significance testingA t-significance test is used to consider whether an individual parameter value of interest is plausible in light of sample data.The situation in which the parameter has that value is referred to as the **null hypothesis**, and the situation in which it does not is referred as to the **alternative hypothesis**. For example: could be the difference between left eye value and right eye value actually be zero? Does a certain large company pay better than the national average, or could salaries there just be different through random chance?A *test statistic* measures how far away the data in our sample are from what we would expect if the null hypothesis $H_0$ were true.The most common test statistic is the *z*-statistic. It determines how far an observed value is from the expected value, measured in standard errors.We already saw this formula before (@eq-zscore)$$z= \frac{observed - expected}{SE}$$ {#eq-zscore3}Observed is a statistic that is appropriate for assessing $H_0$. For example, if we toss a coin ten times and want to know if it is a fair coin, the appropriate statistic would be the number of tails or the percentage of tails.Expected and SE are the expected value and the SE of this statistic computed under the assumption that $H_0$ is true.::: exercise-blockExample:We toss a coin 10 times, and get 7 tails. Using the formulas for the expected values of sums we have: Number of expected tails if the coin is fair:For a binomial distribution we have: $E=n * p$ where $n$ is the number of repetitions and $p$ is the the probability of success (0,5), and the standard error for binomial scenarios is $SE = \sqrt{np(1-p)}$ .$$expected = 10 * \frac{1}{2} = 5$$$$SE = \sqrt{10}\sqrt{\frac{1}{2} * \frac{1}{2}} =1.58$$$$z= \frac{7-5}{1.58}=1.27$$By the central limit theorem, the $p$-value can be computed with normal approximation.```{r}z <-1.27# Calculate the cumulative probability up to z(p_value <- (1-pnorm(z))*2)```In other words, we have a 20.4% probability of observing 7 tails in 10 tosses given that the coin is fair.:::If the $z$-value is large, that means that there is a great difference between the observed and the expected, so large values of $z$ are evidence against $H_0$. The strength of the evidence is measured by the $p$-value or observed significance level.The $p$-value is the probability of getting a value of $z$ as extreme or more extreme than the observed, assuming the null hypothesis is true. As we can see in the graph below, as $z$ becomes larger, there is less probability of finding data outside of its limits, so the $p$-value will be smaller. We multiply the $p$ value times 2 because we have to consider the two tails of the graph for this experiment.The $z$-score tells you how far an observed value is from the expected value in terms of standard errors. The $p$-value tells you the probability of observing a test statistic as extreme as, or more extreme than, the observed statistic, under the null hypothesis.```{r, fig.align='center', echo=FALSE}#| echo = FALSEx <-seq(-4, 4, length=100)# Calculate the corresponding y values for the standard normal distributiony <-dnorm(x)# Plot the bell curveplot(x, y, type="l", lwd=2, col="blue", xlab="", ylab="Density", main="Standard Normal Distribution", xaxt='n')# Highlight the area before -1.27polygon(c(-4, seq(-4, -1.27, length=100), -1.27), c(0, dnorm(seq(-4, -1.27, length=100)), 0), col=rgb(0.1, 0.1, 0.9, 0.2), border=NA)# Highlight the area after 1.27polygon(c(1.27, seq(1.27, 4, length=100), 4), c(0, dnorm(seq(1.27, 4, length=100)), 0), col=rgb(0.1, 0.1, 0.9, 0.2), border=NA)# Add vertical lines at z = ±1.27abline(v =1.27, col ="red", lwd =2, lty =2)abline(v =-1.27, col ="red", lwd =2, lty =2)# Add annotation at z = 1.27text(1.27, dnorm(1.27), labels ="z = 1.27", pos =4, col ="red")text(-1.27, dnorm(-1.27), labels ="z = -1.27", pos =4, col ="red")```The **result of a significance test is a** $p$-value, which measures how plausible the value of interest is given the sample data. A $p$-value close to zero indicates the value isn't compatible with the data. As a general rule, if `p<0.05`, the value can be considered questionable.Ideally, you should identify a value of interest (the technical term is *null hypothesis*) before collecting the data. While this isn't always done in practice, doing so helps prevent wrong conclusions that come from the shotgun effect> If you run a test because you observed an interesting pattern in your sample data, the overall chance of encountering a significant result increases. Data is like firing a shotgun with many pellets---some are bound to hit the target purely due to randomness, if you create your null theory based on a characteristic observed in your sample you are likely going to find significance, even if that is not characteristics of the overall population and it was there just by chance.While a $p$-value might tell you that a parameter is different from a hypothesized value, it won't ever say how different it might be or if that difference is important.A *t-significance test* is appropriate for quantitative data under the exact same circumstances as a t-confidence interval: unless the sample is very small and the data is highly non-symmetric. If the data is relatively symmetric and there are no extreme outliers, `n=10` is usually enough. Regardless of the shape of the data, `n=30` is a safe threshold.The value of $\mu$ (mu) is the value we want to test against (our null hypothesis)```{r}file1 <- here::here("data", "optical_sample.xlsx")optical_sample <-read_excel(file1)testResult <-t.test(optical_sample$eye_difference, mu =0) testResult$p.value```$p$-value express the probability of observing the values we have in the sample if the eye difference of the population were in fact 0. The smaller the $p$-value, the smaller the probability. Usually we take the threshold of 5% or $p$-value \<0.05 to reject the null theory, as it means that it is unlikely that the mean observed in our sample can come from a population whose true mean is 0.We can change the null theory to whatever value we want to measure against. For example, can we say that the average age of the population is NOT 59?```{r}testResult <-t.test(optical_sample$Age, mu =59)testResult$p.value ```our $p$-value is `{r}testResult$p.value` which does not allow us to reject the null theory.::: {#SignificanceTestQuickFacts .callout-orange}Quick Facts1. A $p$-value below 0.05 doesn't mean there's a 5% chance the null hypothesis is true. Instead, it means if the null were true, there's a 5% chance of observing data as extreme as, or more extreme than the one that was observed in the sample.2. Statistical power, the ability to detect a true effect is often overlooked. A study can have a high chance of missing real effects (Type II error) even if its $p$-value threshold is stringent.3. In proportional analysis, Simpson's Paradox can occur where a trend seen in several groups reverses when the groups are combined. It underscores the importance of scrutinizing aggregated data.:::### One sided vs two sided test.One has to carefully consider whether the alternative should be one-sided or two-sided test (we are considering the values on the left of -z and the right of z) or only the values on one side. If we are considering a two sided test, the $p$-value gets doubled. It is not ok to change the alternative afterwards in order to get the $p$-value under 5%.Deciding whether to use a one-sided or two-sided t-test depends on the hypothesis you want to test. Here are some guidelines to help you make that decision:Use a **one-sided t-test** when you have a specific direction in mind for your hypothesis. This means you are testing whether the mean is either greater than or less than a certain value, but not both.**Examples**:- **Greater than**: You want to test if the mean score of a new teaching method is greater than the mean score of the traditional method.- **Less than**: You want to test if the mean time to complete a task using a new software is less than the mean time using the old software.Use a **two-sided t-test** when you are interested in any difference from the specified value, regardless of direction. This means you are testing whether the mean is different from a certain value, without specifying the direction of the difference.**Examples**:- You want to test if the mean weight of a sample of apples is different from a known standard weight.- You want to test if the mean score of a new drug is different from the mean score of a placebo.How to Decide1. **Define Your Research Question**: Clearly state what you are trying to find out.2. **Determine the Direction of Interest**: Decide if you are only interested in deviations in one direction (one-sided) or in both directions (two-sided).3. **Formulate Your Hypotheses**: Based on your research question and direction of interest, formulate your null and alternative hypotheses.**Example Scenario**Suppose you are testing a new drug and want to know if it has a different effect on blood pressure compared to a placebo. If you only care whether the drug lowers blood pressure, you would use a one-sided test. If you care whether the drug either lowers or raises blood pressure, you would use a two-sided test.```{r}# Sample datasample_data <-c(78, 82, 85, 90, 76, 79, 81, 77, 74, 88)# Perform one-sided t-testt_test_result <-t.test(sample_data, mu =75, alternative ="greater")# Print the resultprint(t_test_result)# Perform two-sided t-testt_test_result <-t.test(sample_data, mu =75, alternative ="two.sided")# Print the resultprint(t_test_result)```### Common errors in significance testing {#testErrors}At the start of a significance test, the hypothesized value might be true or false. At the end, it may be rejected or not, but we still don't know for sure if $H_0$ is true or not. If the null hypothesis holds, then we say that it is a *true null hypothesis* otherwise is a *false null hypothesis* Altogether, there are four possible combinations of these outcomes:+-----------------------+-----------------------------------+-------------------------------------+|| $H_0$ is true | $H_0$ is false |+======================:+:=================================:+:===================================:+| $H_0$ is rejected | **Type I error** -false positive- | True Negative |+-----------------------+-----------------------------------+-------------------------------------+| $H_0$ is not rejected | True Positive | **Type II error** -False negative - |+-----------------------+-----------------------------------+-------------------------------------+We call the *event* of rejecting the null hypothesis, when it is in fact true, a *Type I error*, we call the *probability* of making a Type I error, the *Type I error rate*, and we say that rejecting the null hypothesis when the $p$-value is less than $\alpha$, *controls* the Type I error rate so that it is equal to $\alpha$We may observe a $p$-value higher than alpha even when the null hypothesis is false. This is a Type II error. We ask ourselves this question: How big does $N$ have to be in order to detect that the absolute value of the difference is greater than zero? Type II error control plays a major role in designing data collection procedures before you actually see the data, so that you know the test you will run has enough sensitivity or power. Power is 1 minus Type II error rate, or the probability that you will reject the null hypothesis when the alternative hypothesis is true.There are several aspects of a hypothesis test that affect its power for a particular effect size. Intuitively, setting a lower $\alpha$ decreases the power of the test for a given effect size because the null hypothesis will be more difficult to reject (for example from 0.05 to 0.01). This means that for an experiment with fixed parameters (i.e., with a predetermined sample size, recording mechanism, etc), the power of the hypothesis test trades off with its Type I error rate, no matter what effect size you target.::: {.callout-orange, appearance="simple", icon="false"} Testing a HypothesisConducting a hypothesis test typically proceeds in four steps. First, we define the null and alternative hypotheses. Next, we construct a test statistic that summarizes the strength of evidence against the null hypothesis. We then compute a $p$-value that quantifies the probability of having obtained a comparable or more extreme value of the test statistic under the null hypothesis. Finally, based on the $p$-value, we decide whether to reject the null hypothesis.Step 1: Define the Null and Alternative Hypotheses.The null hypothesis $H_0$ is the default state of belief about the world. The alternative hypothesis $H_a$ represent something different. We can think about rejecting the null hypothesis as making a discovery about our data, but if we fail to reject it, we do not know whether we failed to reject it because or sample was too small or whether we failed to reject it because the null hypothesis holds.Step 2. Construct the Test StatisticThe way in which we construct our test statistic T depends on the nature of the null hypothesis that we are testing.Step 3. Compute the $p$-valueThe $p$-value is the area under the density function of the null distribution that falls to the right of the T statistic that we calculated in the previous step, and so, gives us the probability of observing a value such us or bigger than the T-Statistic under the assumption that the null hypothesis holds. In general, most commonly-used test statistic follow a well-known statistical distribution under the null hypothesis, such as normal distribution, t-distribution, a $X^2$ distribution or an $F$ distribution, provided that the sample size is sufficiently large and that some other assumptions hold.The $p$-value allow us to transform our test statistic, which is measured on some arbitrary and uninterpretable scale, into a number between 0 and 1 that can be easily interpreted.Step 4. Decide whether to reject the null hypothesis.Once we have computed the $p$-value, it remains for us to decide whether or not to reject the null hypothesis. If the $p$-value is sufficiently small, we want to reject it, but deciding what value is sufficiently small is up to us. :::### Statistical powerStatistical power is the ability of a test to detect a specified effect. A test with low power is unlikely to rule out a hypothesized value, even if the value is false. That is, it has a high probability of type II error. Increasing the power of an inference technique requires a trade-off of one sort or another. In practice, this usually means using a larger sample. A preliminary power analysis can give decision-makers information about the study size needed to detect an effect of interest.A very simple example of how to do that: study planners can estimate the sample size needed to reduce the margin of error in the estimate of a population proportion using this formula: $$n = \left( \frac{1.96}{2E} \right)^2$$where $E$ is the required margin of error. According to this formula, a simple political poll requiring a 3% margin of error would need a sample size of approximately n=1067 respondents.We are going to use our mice dataset to see how we can calculate the power of our t test for type II errors. We consider that the data in the file is our entire population. If we look at the difference average weight for control vs treatment we appreciate that there is in fact a difference or around 9%:```{r}dat <-read.csv("data/mice_pheno.csv") controlPopulation <-filter(dat,Sex =="F"& Diet =="chow") %>% dplyr::select(Bodyweight) %>% unlisthfPopulation <-filter(dat,Sex =="F"& Diet =="hf") %>% dplyr::select(Bodyweight) %>% unlistmu_hf <-mean(hfPopulation)mu_control <-mean(controlPopulation)print(mu_hf - mu_control)print((mu_hf - mu_control)/mu_control *100) #percent increase```Depending on our sample size, this difference will be perceived by our test or not. Let's see an example with sample size = 12 and alpha = 0.05. We will do first a single test and see that we cannot reject the null hypothesis based on the $p$-value:```{r}set.seed(1)N <-12hf <-sample(hfPopulation,N)control <-sample(controlPopulation,N)t.test(hf,control)$p.value```If we ran this experiment multiple times we can get the percentage of times that our $p$-value manage to actually reject the null hypothesis:```{r}repetitions<-2000N<-12alpha<-0.05reject<-function(N, alpha=0.05){ hf <-sample(hfPopulation, N) control <-sample(controlPopulation, N) pval<-t.test(hf,control)$p.value pval < alpha}rejections<-replicate(repetitions,reject(N))mean(rejections)```In this case we see that 25% of the time we correctly rejected the null hypothesis and this is the power of our test. To increase this percentage we should increase the sample size. In the next code section we are going to run the same simulation for various values of N```{r, fig.align='center'}Ns<-seq(5,50,5)power <-sapply(Ns,function(N){ rejections<-replicate(repetitions,reject(N))mean(rejections)})plot(Ns,power, type="b")```There are on the internet several tools that allow you to calculate the the power you need for a particular standard deviation, sizes of n or the effect size you want to detect.When performing a hypothesis test over a large number of data points, the sample size can significantly impact the $p$-value.**Influence of Sample Size on the p-value**:*Increased Sensitivity*: A larger sample size increases the power of the test, meaning it's more likely to detect a true effect if one exists. This is because with more data, we get a clearer picture of the population, reducing the standard error.*Reduced Standard Error*: The standard error of the mean decreases as the sample size increases. This makes the test statistic (such as the $t$-score or $z$-score) larger if there is an effect, which can lead to a smaller $p$-value.*Smaller p-values*: With a large sample size, even small differences from the null hypothesis can result in statistically significant $p$-values. This is because the test becomes more sensitive to detecting small effects.Risk of Overinterpretation: While small $p$-values indicate statistical significance, they do not necessarily imply practical significance. In large datasets, it's crucial to consider the effect size and practical relevance of the results.Example: Consider a hypothesis test to determine whether a new drug has a different effect than a placebo. If you have a very large sample size, even a tiny difference in outcomes between the drug and placebo groups may produce a statistically significant $p$-value, even if the difference is not clinically meaningful.Visual Explanation: Imagine plotting the p-value as a function of sample size. Initially, as the sample size increases, the p-value decreases sharply, indicating stronger evidence against the null hypothesis. However, beyond a certain point, the $p$-value becomes very small for even minute differences, highlighting the need to interpret results in context.Large Sample Size: Leads to smaller $p$-values, higher test sensitivity, and potential for detecting trivial differences as significant.Interpret Results Carefully: Always consider effect size and practical significance, not just statistical significance.```{r, fig.align='center'}# Set seed for reproducibilityset.seed(123)# Generate synthetic datasmall_sample <-rnorm(30, mean =0.03, sd =1)large_sample <-rnorm(2000, mean =0.03, sd =1)# Function to calculate p-values for increasing sample sizescalculate_p_values <-function(data, max_size) { p_values <-vector("numeric", max_size)for (i in2:max_size) { # Start from 2 to avoid t-test error sample_data <- data[1:i] test_result <-t.test(sample_data, mu =0) p_values[i] <- test_result$p.value }return(p_values)}# Calculate p-values for both small and large samplessmall_sample_p_values <-calculate_p_values(small_sample, length(small_sample))large_sample_p_values <-calculate_p_values(large_sample, length(large_sample))# Combine results into a data framep_values_df <-data.frame(Sample_Size =c(1:length(small_sample), 1:length(large_sample)),P_Value =c(small_sample_p_values, large_sample_p_values),Sample_Type =rep(c("Small_Sample", "Large_Sample"), c(length(small_sample), length(large_sample))))# Filter out NA valuesp_values_df <- p_values_df %>%filter(!is.na(P_Value))# Plot the p-values with facetsggplot(p_values_df, aes(x = Sample_Size, y = P_Value, color = Sample_Type)) +geom_line() +labs(title ="Influence of Sample Size on P-Value",x ="Sample Size",y ="P-Value",color ="Sample Type") +theme_minimal() +geom_hline(yintercept =0.05, linetype ="dashed", color ="red") +scale_color_manual(values =c("Small_Sample"="blue", "Large_Sample"="green")) +facet_wrap(~ Sample_Type, scales ="free_x", ncol =1)+ylim(0, 1)```In the graphs above we show how even a very tiny deviation (our data's $\mu =0.03$) from our null hypothesis ($\mu =0$) is showing as significant when the sample is above 1500 data points, the graph below shows the test over only a maximum of 30 data points and never detects the same difference as significant.### Student's t testSo far we have been using the t-test without explaining the *student's t-distribution* that is used under the hood. Let's explain it now.We will explain the student's t test with an example: The health guideline for lead in drinking water is a concentration of not more than 15 parts per billion (ppb). Five independent samples from a reservoir average 15.6 ppb. Is this sufficient evicence to conclude that the concentration $\mu$ in the reservoir is above the standard of 15 ppb?Our null hypothesis is that no, the concentration is just on the standard $H_0:\mu=15$ and the alternative hypothesis is that the concentration is higher than 15 ppb.$$z= \frac{observed - expected}{SE} = \frac{15.6-15}{SE}$$SE of average = $\frac{\sigma}{\sqrt{n}}$ but sigma is unknown. By the boostrap principle we know that we should be able to substitute the standard deviation of the population $\sigma$ with the standard deviation of our sample $s$ however for sample size less or equal to 20, then the normal curve is not a good enough approximation to the distribution of the $z$-statistic. Rather, an appropriate approximation is the Student's t-distribution with n-1 degrees of freedom.```{r, fig.align='center', fig.height=4, fig.width=5}#| echo = FALSE# Create a sequence of x valuesx_values <-seq(-4, 4, by =0.01)# Calculate the density of the t-distribution for different degrees of freedomdf1 <-dt(x_values, df =1)df2 <-dt(x_values, df =2)df5 <-dt(x_values, df =5)df_inf <-dnorm(x_values) # Normal distribution as t-distribution with infinite degrees of freedom# Create a data frame for plottingdata_to_plot <-data.frame(x =c(x_values, x_values, x_values, x_values),density =c(df1, df2, df5, df_inf),degree_of_freedom =factor(c(rep(1, length(df1)), rep(2, length(df2)), rep(5, length(df5)), rep('∞', length(df_inf)))))# Plot using ggplotggplot(data_to_plot, aes(x = x, y = density, color = degree_of_freedom)) +geom_line() +labs(title ="Student's t-Distribution", x ="x", y ="Density") +scale_color_discrete(name ="ν")```In the graph we can see the purple line where the degrees of freedom are high (large sample) is just the normal curve. The rest of the lines are showing how with less degrees of freedom the tails of the curve get bigger, representing higher uncertainty. We saw the **sample standard deviation formula** (@eq-sampleStandarDeviation):$$s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline{x})^2}$$In this case we also need to adjust our **confidence interval** formula from: $CI=estimate\mp z\ SE$to $$CI= estimate (\mu) \mp t_{n-1}SE$$ {#eq-confidenceIntervalTStudent}::: {.callout-orange appearance="simple" icon="false"}::: centered-text**z-test vs t-test**:::The main differences between a t-test and a $z$-test lie in their assumptions and applications:**T-Test**- **Distribution**: Uses the Student's t-distribution.- **Population Variance**: Unknown and estimated from the sample.- **Sample Size**: Typically used for small sample sizes (n \< 30).- **Degrees of Freedom**: Required for the calculation.- **Application**: Used when comparing the means of two groups, especially when the sample size is small and the population variance is unknown.**Z-Test**- **Distribution**: Uses the standard normal distribution (z-distribution).- **Population Variance**: Known.- **Sample Size**: Typically used for large sample sizes (n \> 30).- **Degrees of Freedom**: Not required.- **Application**: Used for hypothesis testing of means and proportions when the sample size is large and the population variance is known.[**Key Differences**]{.underline}1. **Distribution**: - T-test: Student's t-distribution. - $z$-test: Standard normal distribution.2. **Population Variance**: - T-test: Unknown and estimated from the sample. - $z$-test: Known.3. **Sample Size**: - T-test: Small sample sizes (n \< 30). - $z$-test: Large sample sizes (n \> 30).4. **Degrees of Freedom**: - T-test: Required. - $z$-test: Not required.[**Example Scenarios**]{.underline}- **T-Test**: Comparing the average test scores of two small classes where the population variance is unknown.- **Z-Test**: Testing the average height of a large population where the population variance is known.:::T-test are the most commonly used test in real life, but if we meet all the criteria to perform a $z$-test we can do it like this in r:```{r}library(BSDA)# Sample datasample_data <-c(88, 92, 94, 94, 96, 97, 97, 97, 99, 99, 105, 109, 109, 109, 110, 112, 112, 113, 114, 115)population_mean <-100population_sd <-15# Perform a one-sample z-testz_test_result <-z.test(sample_data, mu = population_mean, sigma.x = population_sd)print(z_test_result)```# Categorical dataIn 1912 the Titanic sank and more than 1500 of the 2229 people on board died. Did the chances of survival depend on the ticket class?|| First | Second | Third | Crew ||----------|-------|--------|-------|------|| Survived | 202 | 118 | 178 | 215 || Died | 123 | 167 | 528 | 698 |This is an example of categorical data. The data are counts for a fixed number of categories. Here the data is tabulated in a 2x4 table. Such table is called a **Contingency Table** because it shows the survival counts for each category of ticket class, i.e. contingent on ticket class.## Confidence Intervals for Proportions.Confidence intervals for proportions provide a range of plausible values for population proportions, enabling estimation with a desired level of confidence.Proportional analysis techniques allow for the assessment and comparison of proportions across different categories, providing valuable insights into categorical data.There are any number of statistical methods for computing confidence intervals for proportions. By far the simplest is:$$p = \hat{p} \pm \sqrt{\frac{\hat{p} \cdot (1 - \hat{p})}{n}}$$ {#eq-confidenceIntervalProportions}where $p$ is the actual population proportion and $\hat{p}$ is the observed proportion.As long as the sample size isn't too small, the difference between methods should be minor. If your sample includes at least 5 of each sort of possible outcomes, you are fine with whatever default method your software uses to calculate the proportions.In R we will use $prop.test(n_s , sample size)$ where $n_s$ is the number of successes.Practice:We are going to calculate the proportion of smokers in our population based on our optical_sample data```{r}file1 <- here::here("data", "optical_sample.xlsx")optical_sample <-read_excel(file1)as.data.frame(report(optical_sample$IsSmoker))successNumber<-sum(optical_sample$IsSmoker)sampleSize <-NROW(optical_sample$IsSmoker)testResult <-prop.test(successNumber, sampleSize)testResult#proportion of success proportion<- testResult$estimate %>%print()conf_int <- testResult$conf.int%>%print()```We can use the formula we studied at the beginning of the chapter to manually calculate the confidence intervals as well and see the difference with what the software used:```{r}ci_lower<- proportion -sqrt(proportion*(1-proportion)/sampleSize) ci_upper<- proportion +sqrt(proportion*(1-proportion)/sampleSize) ci_lowerci_upper```Another way of manually calculating the confidence interval is using a $z$-scoreIn the example below we use 95%: (qnorm(0.975) is the $z-score$ corresponding for 95% conf. level)```{r}margin_of_error <-qnorm(0.975) *sqrt(proportion * (1- proportion) / sampleSize)# Calculate the confidence interval boundsproportion - margin_of_errorproportion + margin_of_error```For different confidence levels you will need to find the correct $z$-score### Non-binary categorical variablesIf our categorical variable is not binary, we still can use this same test to calculate the proportion of one single category against the rest.For example, to calculate the proportion of women in our pm_survey dataset:```{r}file2 <- here::here("data", "pm_survey.xlsx")pm_survey <-read_excel(file2)kable(head(pm_survey))%>%kable_styling(latex_options ="scale_down")%>%landscape()as.data.frame(report(pm_survey$gender))pm_survey <- pm_survey |>mutate(across(.cols =c(highest_level_of_education_completed, gender), as.factor))numberOfWomen<-NROW(pm_survey[pm_survey$gender =='Woman',])sampleSize <-NROW(pm_survey)prop.test(numberOfWomen, sampleSize)```### Few observations for each outcomeWhile there are statistical methods for dealing with samples with fewer than 5 or each sort of outcome, you should be conservative about making decisions based on them. That said, there are two methods that might be helpful in such situations:- The wilson score interval works well even for observed proportions near zero or one and has other good theoretical properties as well. Unlike other confidence intervals we've seen it isn't symmetric.- The rule of three is a great rought-and-ready way to compute a 95% confidence interval for a proportion when no positive results have been observed. If you have failures only, the interval will go from 0 to 3/n where n is the sample size, and if you have observed successes only, the confidence interval will go from 1 - (3/n) to 0## Significance testing for proportionsWe want to know if the proportion of the people taking medication in our sample is the same as the proportion of people in USA taking medication. This data we know is 66% of the USA population so we can plug in that value in our test just like that in the formula using $p=$ where p, in this case is the proportion for our null theory.```{r}file <- here::here("data", "optical_sample.xlsx")optical_sample <-read_excel(file)as.data.frame(report(optical_sample$TakingMedication))medicated<-sum(optical_sample$TakingMedication =="yes")sampleSize <-nrow(optical_sample)testResult <-prop.test(medicated, sampleSize, p = .66)testResult$p.valuetestResult$conf.int```In our case we can reject the null hypothesis and say that it is not reasonable to believe that this sample was drawn from a population with sample mean of 66%::: exercise-boxExercises**Exercise 1.** A media outlet claims that 60% of project managers in the U.S have a college degree as their highest level of education. Is that claim plausible using our pm_survey data set?```{r}file <- here::here("data", "pm_survey.xlsx")pm_survey <-read_excel(file)pmWithCollegeDegree <-sum(pm_survey$highest_level_of_education_completed =='College degree')sampleSize <-nrow(pm_survey)testResult <-prop.test(pmWithCollegeDegree, sampleSize, p=0.6)testResult$p.valuetestResult$conf.int```In this case we cannot reject the null hypothesis.**Exercise 2**. Use that data set to construct a 95% confidence interval for the proportion of project managers that are under the age of 35.```{r}pm_under35 <-nrow (pm_survey[pm_survey$how_old_are_you %in%c("18-24","25-34"),])testResult <-prop.test(pm_under35, sampleSize)testResult$conf.int testResult$estimate```The confidence interval shows that between 44% and 57% of respondents are under 35:::## Goodness of Fit{#sec-goodness-of-fit}Consider genetic data where you have two groups of genotypes (A or B) for cases and controls for a given disease. The statistical question is if genotype and disease are associated. As in the examples we have been studying previously, we have two populations (A and B) and then numeric data for each, where disease status can be coded as 0 or 1. So why can't we perform a t-test? Note that the data is either 0 (control) or 1 (cases). It is pretty clear that this data is not normally distributed so the t-distribution approximation is certainly out of the question. We could use CLT if the sample size is large enough, otherwise, we can use *association tests*.Imagine we have 250 individuals, where some of them have a given disease and the rest do not. We observe that 20% of the individuals that are homozygous for the minor allele (group B) have the disease compared to 10% of the rest. Would we see this again if we picked another 250 individuals?```{r}disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),labels=c("control","cases"))genotype=factor(c(rep("A",200),rep("B",50)),levels=c("A","B"))dat <-data.frame(disease, genotype)dat <- dat[sample(nrow(dat)),] #shuffle them uptab<-table(genotype,disease)tab```The typical statistics we use to summarize these results is the odds ratio (OR). We compute the odds of having the disease if you are an "B": 10/40, the odds of having the disease if you are an "A": 20/180, and take the ratio: $(10/40) / (20/180)$```{r}(tab[2,2]/tab[2,1]) / (tab[1,2]/tab[1,1])```To compute a $p$-value, we don't use the `OR` directly. We instead assume that there is no association between genotype and disease, and then compute what we expect to see in each cell of the table under the null hypothesis: ignoring the groups, the probabilty of having the disease according to our data is:```{r}p=mean(disease=="cases")p```according to this probability, under our null hypothesis we expect to see a table like this:```{r}expected <-rbind(c(1-p,p)*sum(genotype=="A"),c(1-p,p)*sum(genotype=="B"))dimnames(expected)<-dimnames(tab)expected```The Chi-square test uses an asymptotic result (similar to the CLT) related to the sums of independent binary outcomes. Using this approximation, we can compute the probability of seeing a deviation from the expected table as big as the one we saw. The $p$-value for this table is:```{r}chisq.test(tab)$p.value```so we would expect to find this difference in disease ratio by chance approximately 8.8% of the times, which is not enough to reject our null hypothesis.**Large Samples, Small** $p$-valuesReporting only $p$-values is not an appropriate way to report the results of your experiment. Studies with large sample sizes will have impressively small $p$-values. Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1.To demonstrate, we recalculate the $p$-value keeping all the proportions identical, but increasing the sample size by 10, which reduces the $p$-value substantially:```{r}tab<-tab*10chisq.test(tab)$p.value```::: {.callout-orange appearance="simple" icon="false"}1. **Fitting Fallacies**: Goodness of fit doesn't guarantee predictive power. A model can fit past data perfectly yet fail miserably on new, unseen data, highlighting the dangers of overfitting.2. **The Sample Size Paradox**: Doubling your sample size doesn't necessarily halve the error. In fact, to do so, you'd need to quadruple the sample, given the square root relationship between sample size and margin of error.3. **Two-Sample Twists**: When comparing two samples, it's possible for each sample's individual data to appear random, yet their combined data can reveal a distinct pattern or significant difference.:::A Goodness-of-fit test, also called chi-squared or Pearson goodness-of-fit test, considers whether a categorical variable has a hypothesized distributionWarning! **A goodness of fit test doesn't give any information about which specific categories might be out of line with the hypothesized distribution.** While you might be able to make an educated guess by looking at the data, this test shouldn't be used to support that kind of intuition.In 208 the manufacturer of M&Ms published their last color distribution:| Blue | Orange | Green | Yellow | Red | Brow ||------|--------|-------|--------|-----|------|| 24% | 20% | 16% | 14% | 13% | 13% |We open several packages of M&Ms and count the colors:| Blue | Orange | Green | Yellow | Red | Brow ||------|--------|-------|--------|-----|------|| 85 | 79 | 56 | 64 | 58 | 68 |Are these counts consistent with the last published percentages? is there sufficient evidence to claim that the color distribution is different? This question requires a test of goodness-of-fit for the six categories.Our null hypothesis is that the color distribution is the same. The idea is to compare the observed counts to the numbers we would expect if \$H_0 \$ is true, so for our sample of 410 M&Ms we would expect:| Blue | Orange | Green | Yellow | Red | Brow ||------|--------|-------|--------|------|------|| 98.4 | 82 | 65.6 | 57.4 | 53.3 | 53.3 |We look at the difference of the observed and the expected values, we square that difference and we standardize it by dividing by the expected$$\chi^2 =\sum_{categories} \frac{(observed-expected)^2}{expected}$$$$\frac{(85 - 98.4)^2}{98.4} + \frac{(79 - 82)^2}{82} + \cdots + \frac{(68 - 53.3)^2}{53.3} = 8.57$$Large values of the chi-square statistic $\chi^2$ are evidence against the null hypothesis. To calculate the $p$-value we use the chi-square distribution. The $p$-value is the right tail of the $\chi^2$ distribution with degrees of freedom = number of categories -1.```{r, fig.align='center'}df <-5# Create the chi-square distribution curvecurve(dchisq(x, df = df), from =0, to =40, main =paste("Chi-Square Distribution (df =", df, ")"),ylab ="Density", lwd =2, col ="steelblue")# Draw a vertical line at x = 8.57abline(v =8.57, col ="red", lwd =2, lty =2)# Highlight the area to the right of x = 8.57x_vector <-seq(8.57, 40, length.out =1000)y_vector <-dchisq(x_vector, df = df)polygon(c(8.57, x_vector, 40), c(0, y_vector, 0), col =adjustcolor("red", alpha.f =0.3), border =NA)```in our example, this $p$-value happens to be 12.7%, which does not allow us to reject the null hypothesis. In r we can calculate the $p$-value like this:```{r}# Set the chi-square value and degrees of freedomx <-8.57df <-5# Calculate the p-valuep_value <-pchisq(x, df, lower.tail =FALSE)# Print the p-valuep_value```::: exercise-boxExercise:In the exercise below we want to know if the age population in our project managers survey data matches the distribution of ages for the USA population. We get the distribution of USA from wikipedia and do some adjustments so the ranges matches those in our survey:```{r}file1 <- here::here("data", "pm_survey.xlsx")pm_survey <-read_excel(file1)us_ages <-c(.117, .176, .168, .158, .166, .215)table(pm_survey$how_old_are_you)obs_ages <-as.numeric(table(pm_survey$how_old_are_you))testResult <-chisq.test(obs_ages, p = us_ages)testResult %>%report()testResult$p.value```The very small $p$-value indicates that is extremely unlikely that our pm_survey data was extracted at random from a population with the us_ages distribution.As mentioned, the goodness of fit test does not tell us what categories show the discrepancies in the data, if we want to find out what is the difference between our sample range of ages and the USA data we can just subtract the proportions from each and see what categories are sub represented and vice versa```{r}obs_ages /sum(obs_ages)obs_ages /sum(obs_ages) - us_ages```Now we want to see if in our optical sample the customers are evenly distributed between the opticians. If we don't pass a $p$ attribute to the chi squared test it will assume you are asking for even distribution between all categories:```{r}file2 <- here::here("data", "optical_full.xlsx")optical_full <-read_excel(file2)counts <-as.numeric(table(optical_full$`Optician Last Name`))testResult<-chisq.test(counts)testResult %>%report()testResult$p.value```In this case we cannot reject the null hypothesis, which in this case is that the count of patients is evenly distributed among all opticians.:::::: exercise-boxExerciseDoes the distribution of populations of towns in the United States follow [Benford's law](https://en.wikipedia.org/wiki/Benford%27s_law)? Check using the towns data set.```{r, fig.align='center'}theme_set(theme_minimal())file <- here::here("data", "towns.xlsx")towns <-read_excel(file)benford <-c(.301, .176, .125, .097, .079, .067, .058, .051, .046)sum(benford)#we calculate the frequency of each first digit in the population of towns.table(towns$first_digit)digits_freq <-as.numeric(table(towns$first_digit))digits_freqtestResult<-chisq.test(digits_freq, p = benford)testResult %>%report()testResult$p.value# An illustrative plotbenford_df <-data.frame(distribution =c(rep("Towns", 9), rep("Benford", 9)),first_digit =rep(1:9, 2),frequency =c(digits_freq/sum(digits_freq), benford))ggplot(benford_df, aes(x = first_digit, y = frequency,fill = distribution)) +geom_col(position ="dodge") +scale_x_continuous(breaks =1:9) +labs(x ="First digit",y ="Relative frequency",fill ="Distribution") +scale_fill_brewer(palette ="Dark2")```our $p$-value indicates that we cannot reject the null hypothesis, meaning in this case that the distribution of the first digit in towns across US is compatible with Benford's law.:::# Two-Sample testingTwo-sample testing allows for the comparison of two independent groups or populations to assess if there are statistically significant differences between their means, proportions, or other relevant measures.*Confidence intervals* for two-sample comparison provide a range of plausible values for the difference in means or proportions, allowing for estimation with a desired level of confidence.*Significance testing* for two-sample comparison involves evaluating the evidence against the null hypothesis and determining if the observed differences between groups are statistically significant.In this simple exercise we are going to generate ratings for two products at random. We have not changed the probability of each ranking so both products should have the same rating so the difference of their mean ratings should be 0 if we have enough samples.Creating many samples at random, calculating their differences and plotting those differences will show how, although we can see that the bell curve distributes more or less evenly from 0 as expected, many of the samples showed a difference```{r, fig.align='center'}theme_set(theme_minimal())set.seed(27)rating <-sample(1:5, 27, replace =TRUE)product <-c(rep("A", 15), rep("B", 12))AB_testing <-data.frame(product, rating)AB_testing |>group_by(product) |> dplyr::summarize(avg_rating =mean(rating))AB_testing |>group_by(product) |> dplyr::summarize(avg_rating =mean(rating)) |>ggplot(aes(x = product, y = avg_rating,fill = product)) +geom_col() +scale_fill_brewer(palette ="Dark2") +theme(legend.position ="none") +labs(x ="Product",y ="Average rating")# repeat the samples 10000 times:difference <-integer()for (rep in1:10000){ rating <-sample(1:5, 27, replace =TRUE) difference[rep] <-mean(rating[1:15]) -mean(rating[16:27]) }qplot(difference, binwidth = .2, xlab ="Difference in ratings")```In reality we will only have access to one of the samples, and we have to be able to tell if it's reasonable to draw a conclusion about the population just based on the sample or whether or not we should just attribute these sorts of differences to random chance.## Significance testing for Two-Samples data### z-testThe significance test for two samples uses the null hypothesis that there is no difference in the means of the two population means. The $p$-value will be used to reject or not that null hypothesys.we can use a $z$-test for the difference between the two means:$$z = \frac{observed\ difference- expected\ difference}{SE\ of\ difference} = \frac{(\bar{x_2} - \bar{x_1})-0}{SE\ of\ difference}$$our expected difference is 0 because that's our null hypothesis (no difference). An important fact is that if $\bar{x_1}$ and $\bar{x_2}$ are independent, then:$$ SE(\bar{x_2} - \bar{x_1}) = \sqrt{ (SE(\bar{x_1}))^2 +(SE(\bar{x_2}))^2}$$::: exercise-boxExercise: two-sample $z$-test (proportions.)*Last month, the president's approval rating in a sample of 1000 likely voters was 55%. This month, a poll of 1,500 likely voters resulted in a rating of 58%. Is this sufficient evidence to conclude that the rating has changed?*$\hat{p_1} = 55%$ and $\hat{p_2} = 58%$$$z = \frac{(\hat{p_2}-\hat{p_1})-0}{SE_{diff}}=\frac{(\hat{p_2}-\hat{p_1})-0}{\sqrt{ (SE(\bar{x_1}))^2 +(SE(\bar{x_2}))^2}}$$ The formula for the standard error for the proportion is$$ SE= \sqrt{\frac{p(1-p)}{n}}$$Calculate the standard errors:For $\hat{p_1}$: $$SE(\hat{p_1}) = \sqrt{\frac{\hat{p_1}(1 - \hat{p_1})}{n_1}} = \sqrt{\frac{0.55 \times 0.45}{1000}}= \sqrt{\frac{0.2475}{1000}} = \sqrt{0.0002475} \approx 0.0157$$For $\hat{p_2}$:$$SE(\hat{p_2}) = \sqrt{\frac{\hat{p_2}(1 - \hat{p_2})}{n_2}} = \sqrt{\frac{0.58 \times 0.42}{1500}} = \sqrt{\frac{0.2436}{1500}} = \sqrt{0.0001624} \approx 0.0127$$$$SE_{diff} = \sqrt{(0.0157)^2 + (0.0127)^2} = \sqrt{0.00024649 + 0.00016129} = \sqrt{0.00040778} \approx 0.0202$$$$z = \frac{(0.58 - 0.55) - 0}{0.0202} = \frac{0.03}{0.0202} \approx 1.49$$The calculated $z$-value is approximately 1.49. To determine if this is statistically significant, you would compare this $z$-value to the critical value from the standard normal distribution (typically 1.96 for a 95% confidence level). Since 1.49 is less than 1.96, you would fail to reject the null hypothesis at the 95% confidence level, indicating that there is not sufficient evidence to conclude that the president's approval rating has changed significantly.The $p$-value can be calculated using standard normal distribution tables or software```{r}pValue <-pnorm(1.49)pValue```Since this is a two-tailed test (we are checking if the approval rating has changed, not just increased or decreased), we need to consider both tails of the distribution, so we double our values: The $p$-value is calculated as $2 \times (1 - \text{cumulative probability})$. $p = 2 \times (1 - 0.9318) = 2 \times 0.0682 = 0.1364$The $p$-value for the $z$-value of 1.49 is approximately 0.1364. Since this $p$-value is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This means there is not sufficient evidence to conclude that the president's approval rating has changed significantly.:::The exercise above shows how to calculate the $p$ value for proportions, if we are working with average values instead of proportions the calculation is the same, only thing to consider is that in this case is that Standard deviation of each individual sample will be calculated using the formula for the standard deviation for the mean $SE(\bar{x_1})= \frac{\sigma_1}{\sqrt{n_1}}= \frac{s_1}{\sqrt{n_1}}$ . If the sample sizes are not large, then the $p$-value needs to be computed from the t-distribution instead.### The Welch Two Sample t-testIf we have two sets, one control and one test, their means being $\mu_t$ and $\mu_c$ the two-sample t-statistic is defined as:$$T = \frac{\hat\mu_t-\hat\mu_c}{s\sqrt{\frac{1}{n_t}+\frac{1}{n_c}}}$$ {#eq-two-sampletstatistic}where$$s=\sqrt{\frac{(n_t-1)s^2_t+(n_c-1)s^2_c}{n_t+n_c-2}}$$is an estimator of the pooled standard deviation of the two samples. Here, $s^2_t$ and $s^2_c$ are unbiased estimators of the variance of our metric in the treatment and control group, respectively. A large (absolute) value of T provides evidence against $H_0$In the example below we are going to use the attrition dataset to see if the average age of employees in two different departments is different?```{r}file <- here::here("data", "attrition1.xlsx")attrition1 <-read_excel(file)kable(head(attrition1))%>%kable_styling(latex_options ="scale_down")%>%landscape()attrition1 |>group_by(Department) |> dplyr::summarize(avg_age =mean(Age))testResult <-t.test(Age ~ Department, data = attrition1)testResult$p.valuereport(testResult)```## Confidence intervals for Two-Samples data.We can use the formula for the standard error of the difference to also do a confidence interval calculation. The confidence interval for $p_2-p_1$ is $(\hat{p_2}-\hat{p_1}) \mp z \times SE(\hat{p_2}-\hat{p_1})$were $z$ is the $z$-score for the confidence level we are interested in.::: exercise-boxin our example about the voters approval of the president:the $z$-score value for a confidence level of 95% is 1.959964$$58-55 \mp 1.96 \times 0.0202 \approx [-0.0705,0.0105]$$If we want to resolve the same exercise using r code it gives similar but not exactly the same results:```{r}successes <-c(0.55*1000, 0.58*1500)# Define the sample sizessample_sizes <-c(1000, 1500)# Perform the two-sample z-test for proportionstest_result <-prop.test(successes, sample_sizes)# Print the resultprint(test_result)```:::There are many statistical techniques for describing the difference between a single variable across two populations. The most universal is the **Welch two-sample confidence interval**, which is a statistical technique used to compare means between two independent groups, taking into account the unequal variances often encountered in real-world data, so it valid in nearly all circumstances. A few things to bear in mind:- It is a multi-sample procedure, not a multi-variable one. Use it when you're asking how much two samples differ in a single variable.- Like every other statistical tool in this course, it requires that all the observations be independent of one another (not to be used with time-line analysis)- Unless the data has extreme outliers or is highly asymmetric, it will give good results when both samples are of size 10 or more. If the samples are of size at least 30, it is fine under all but the most extreme circumstances.The Welch two-sample confidence interval does not require that the samples are equal in size or that the populations have equal variances, unlike some other procedures.We are going to use the AB_testing data we generated in the previous section and calculate the confidence intervals for the differences observed in one of our samples```{r}set.seed(27) rating <-sample(1:5, 27, replace =TRUE) product <-c(rep("A", 15), rep("B", 12)) AB_testing <-data.frame(product,rating) AB_testing |>group_by(product) %>% dplyr::summarize(avg_rating =mean(rating)) as.data.frame(report(AB_testing)) testResult<-t.test(rating ~ product, data = AB_testing) testResult %>%report() testResult$conf.int ```The confidence interval says that with 95% confidence, the population mean difference is between `r testResult$conf.int[1]` and `r testResult$conf.int[2]`. This is actually saying that the difference in the means could be 0.If you know or can assume that the variance of the two populations is equal, then you can use var.equal = TRUE in the t.test, and in this case instead of using a Welch Two sample t-test, it will use a Two sample t-test```{r}t.test(rating ~ product, data = AB_testing, var.equal =TRUE) testResult<-t.test(rating ~ product, data = AB_testing) testResult %>%report() testResult$conf.int ```In the example below we want to use our substance_abuse data set and know if there is a difference in the variable DLA_improvement based on the program that the patient was following. We are assuming that the variance is equal in both cases:We can change the confidence level manually if we want:```{r}file <- here::here("data", "substance_abuse.xlsx") substance_abuse <-read_excel(file) substance_abuse$DLA_improvement <- substance_abuse$DLA2 - substance_abuse$DLA1 t.test(DLA_improvement ~ Program, data = substance_abuse, conf.level = .99)```::: exercise-boxExercises**Exercise 1**: *Generate a 95% confidence interval for the difference in average monthly income between the research and development and sales departments of the company*```{r}testResult <-t.test(MonthlyIncome ~Department, data= attrition1)testResult$conf.intreport(testResult)```The confidence interval does not include 0, which means that there is a difference in the means of the populations of both groups**Exercise 2**: *It is reasonable to claim that the montly rate is the same bewteen these two departments?*```{r}testResult <-t.test(MonthlyRate ~Department, data= attrition1,mu =0)testResult$p.valuereport(testResult)```In light of the results we cannot reject the null hypothesis that they have the same monthly rate.:::### Pooled estimate:Going back to our recent exercise of the voters's approval rate, we concluded that the two different surveys did not significantly differed.Since we could not reject the null hypothesis that the two samples are representing the same approval for the candidate, we can combine the two of them to find a better estimate of our Standard Errorin the first sample we have 0,55 x 1000 voters who approved, in the second sample we have 0.58x 1500 so in total we have 1420 approvals out of 2500 people surveyed, so our ppoled estimate will be $\frac{1420}{2500}=56.8\%$ so now we can calculate the Standard Error using that new value:$$ SE(\hat{p_2}-\hat{p_1}) = \sqrt{\frac{0.568(1-0.568)}{1000}+\frac{0.568(1-0.568)}{1500}}=0.02022 $$### Pooled standard deviation:If we know (or there is reason to assume) that the standard deviation of the two populations is the same $\sigma_1=\sigma_2$ then we can use the pooled estimate:$$ s^2_{pooled}=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2} $$::: {.callout-orange appearance="simple" icon="false"}**Two-Sample** $z$-Test vs Two sample T-test vs Welch's Two-Sample T-Test**Two sample** $z$-testWhen to Use:- **Known Population Variances**: The population variances are known.- **Large Sample Sizes**: Typically used when the sample sizes are large (n \> 30).- **Normal Distribution**: Assumes that the data follows a normal distribution.**Two sample T-test**- **Unknown Population Variances**: The population variances are unknown and assumed to be equal.- **Small Sample Sizes**: Typically used when the sample sizes are small (n \< 30).- **Normal Distribution**: Assumes that the data follows a normal distribution.**Welch's Two-Sample T-Test**When to Use:- **Unknown and Unequal Population Variances**: The population variances are unknown and not assumed to be equal.- **Small or Unequal Sample Sizes**: Can be used for small or unequal sample sizes.- **Normal Distribution**: Assumes that the data follows a normal distribution.**Summary:**- **Population Variances**: - **Z-Test**: Assumes known and equal population variances. - **Welch's T-Test**: Does not assume equal variances and uses sample variances.- **Sample Size**: - **Z-Test**: Typically used for large sample sizes. - **Welch's T-Test**: Can be used for small or unequal sample sizes.:::## Paired-t testWhat do we do when we have two samples, but they are not independent from each other? In this case we cannot use the classical two-sample $z$-test or two-sample t-testWe want to answer the question: Do husbands tend to be older than their wives?| Husband's age | Wife's age | age difference ||---------------|------------|----------------|| 43 | 41 | 2 || 71 | 70 | 1 || 32 | 31 | 1 || 68 | 66 | 2 || 27 | 26 | 1 |In a scenario like this, even if the samples were independent, the test would not give us a significant difference because the difference between the two pairs is always small, while the difference in the values in each sample (standard deviations) are large and what the two samples test does is to compare the differences to the fluctuation within each population.In this case we will use the paired t-test. Our $H_0$ is that the population difference is 0. The formula for this test is:$$t=\frac{\bar{d}-0}{SE_{(d)}}$$where $\bar{d}$ is the average of the differences,0 is the expected difference that under our null hypothesis is 0 and $SE_{\bar{(d)}}$ is the standard error for the difference:$$SE_{(d)}= \frac{\sigma_d}{\sqrt{n}} = \frac{s_d}{\sqrt{n}}$$In the example with our data:$$t=\frac{1.4}{\frac{0.55}{\sqrt{5}}}=5.69$$and now we have to use a table of a student t-distribution with 4 degrees of freedom to find the area under the curve of the normal distribution to the right of our t value.we can use R code to calculate it like this:```{r}# Given valuest_value <-5.69degreesfreedom <-4# Calculate the p-value for a two-tailed testp_value <-2*pt(-abs(t_value), degreesfreedom)p_value```In this case our result means that we can reject the null hypothesis.## The sign testImage that we don't know the age difference, we only know if the husbands are older or not. We can follow here the same approach as with a binomial distribution, like the coin toss. In this specific case our null hypothesis $H_0$ is that half of the husbands are older than their wifes (no difference). We assign labels to the results, for example 1 if the husband is older than the wife and 0 otherwise.$$z=\frac{sum\ of\ 1s -\frac{n}{2}}{SE\ of\ sum} =\frac{sum\ of\ 1s -\frac{n}{2}}{\sqrt{n}\times\sigma_{H_0}}$$in our case we have 5 husbands being older than their wifes, so 5 1s and the standard deviation for our null hypothesis will be $\frac{1}{2}$ because we expect half the husbands to be older then their wives.$$z= \frac{5-\frac{5}{2}}{\sqrt{5}\frac{1}{2}}= 2.24$$now we can find the $p$ value using a table or software:```{r}z_score<-2.24# Calculate the p-value for a two-tailed testp_value <-2* (1-pnorm(z_score))p_value```we can see that the result of this test is less significant than the test we did before, this is because we have less data to work with.If we want to resolve the problem using software we can do the calculations:```{r}# Number of successes (husbands older than wives) successes <-5# Total number of pairs n <-5# Z-test (Sign test approximation) z_score <- (successes -0.5* n) /sqrt(0.25* n) z_p_value <-2* (1-pnorm(z_score)) z_p_value```but if we use the corresponding binomial test instead that would apply for this scenario, we get to quite a different result:```{r}# Number of successes (husbands older than wives) successes <-5# Total number of pairs n <-5# Perform the binomial test test_result <-binom.test(successes, n, p =0.5, alternative ="two.sided") print(test_result)```::: {.callout-orange, appearance="simple", icon="false"}Both the two-sample paired t-test and the sign test are used to compare paired data, but they are applied in different situations based on the assumptions and characteristics of the data.**Two-Sample Paired T-Test****When to Use**:- **Normal Distribution**: The differences between the paired observations should be approximately normally distributed.- **Interval or Ratio Data**: The data should be measured on an interval or ratio scale.- **Parametric Test**: This test is parametric, meaning it relies on assumptions about the distribution of the data.**Example**:- Comparing the blood pressure of patients before and after a treatment.- Measuring the weight of individuals before and after a diet program.**Sign test**1. **Paired Observations**: When you have paired data (e.g., before and after measurements) and you want to test if there is a consistent difference between the pairs. For example, testing if a treatment has an effect by comparing measurements before and after the treatment.2. **Median Differences**: When you want to test if the median of differences between pairs is zero. This is useful when the data does not meet the assumptions required for parametric tests like the paired t-test. It does not rely on assumptions about the distribution of the data.3. **Non-Normal Data**: When the data does not follow a normal distribution, making parametric tests inappropriate. The sign test does not assume any specific distribution for the data.4. **Ordinal Data**: When the data is ordinal (ranked) rather than interval or ratio. The sign test can handle data that can only be compared as greater than, less than, or equal to.[**Example Scenarios**]{.underline}- **Medical Studies**: Comparing the effectiveness of a treatment by measuring patient conditions before and after the treatment.- **Quality Control**: Testing if a new manufacturing process consistently produces better results than the old process.- **Behavioral Studies**: Comparing responses before and after an intervention. Comparing the number of days patients feel better before and after a new medication.[**Summary:**]{.underline}- **Use a paired t-test** when the differences between paired observations are normally distributed and you have interval or ratio data.- **Use a sign test** when the data does not meet the normality assumption, is ordinal, or you prefer a non-parametric approach. :::## Wilcoxon Rank Sum TestWe learned how the sample mean and SD are susceptible to outliers. The t-test is based on these measures and is susceptible as well. The Wilcoxon rank test (equivalent to the Mann-Whitney test) provides an alternative. In the code below, we perform a t-test on data for which the null is true. However, we change one sum observation by mistake in each sample and the values incorrectly entered are different. Here we see that the t-test results in a small $p$-value, while the Wilcoxon test does not:```{r}set.seed(779) ##779 picked for illustration purposesN=25x<-rnorm(N,0,1)y<-rnorm(N,0,1)```Create outliers:```{r}x[1] <-5x[2] <-7cat("t-test pval:",t.test(x,y)$p.value)cat("Wilcox test pval:",wilcox.test(x,y)$p.value)```The basic idea is to 1) combine all the data, 2) turn the values into ranks 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test.```{r rank-test-illustration, fig.cap="Data from two populations with two outliers. The left plot shows the original data and the right plot shows their ranks. The numbers are the w values ",fig.width=10.5,fig.height=5.25}library(rafalib)mypar(1,2)stripchart(list(x,y),vertical=TRUE,ylim=c(-7,7),ylab="Observations",pch=21,bg=1)abline(h=0)xrank<-rank(c(x,y))[seq(along=x)]yrank<-rank(c(x,y))[-seq(along=x)]stripchart(list(xrank,yrank),vertical=TRUE,ylab="Ranks",pch=21,bg=1,cex=1.25)ws <-sapply(x,function(z) rank(c(z,y))[1]-1)text( rep(1.05,length(ws)), xrank, ws, cex=0.8)W <-sum(ws) ````W` is the sum of the ranks for the first group relative to the second group. We can compute an exact $p$-value for $W$ based on combinatorics. We can also use the CLT since statistical theory tells us that this `W` is approximated by the normal distribution. We can construct a $z$-score as follows:```{r}n1<-length(x);n2<-length(y)Z <- (mean(ws)-n2/2)/sqrt(n2*(n1+n2+1)/12/n1)print(Z)```Here the `Z` is not large enough to give us a $p$-value less than 0.05. These are part of the calculations performed by the R function `wilcox.test`.we are not going to get into mathematical detail about these calculations, but the formula for the $z$ score here is $$ z=\frac{U - \frac{n_2}{2}}{\sqrt{\frac{n_2(n_1+n_2+1)}{12n_1}}}$$ and to perform this test in r we use: `wilcox.test(x,y)`## Two-samples of binary dataTwo observed proportions for a single binary variable can be compared directly using the **two-proportion** $z$-confidence interval and **two-proportion** $z$-test. These apply when there are at least 5 observations of each type in each of the two groups. The formulas are relatively simple. For example a 95% confidence interval for the difference between proportions is$$(p_2-p_1) = (\hat p_2-\hat p_1) \pm 1.96 \sqrt{\frac{\hat p_2(1-\hat p_2)}{n_2}+\frac{\hat p_1(1-\hat p_1)}{n_1}} $$where $p_1$ and $p_2$ are the population proportions, $\hat p_1$ and $\hat p_2$ are the observed sample proportions and $n_1$ and $n_2$ are the sample sizes. R will use a improved version of this formula when computing the proportions.In R we will use $prop.test(n_s , sample size)$ where $n_s$ is the number of successes.::: exercise-blockExamples:in the attrition dataset. *Are the attrition proportions different for the two departments?*```{r}kable(head(attrition1))%>%kable_styling(latex_options ="scale_down")%>%landscape()table(attrition1$Department)t<-table(attrition1$Department, attrition1$Attrition)tyes_counts<-as.numeric(t[,2])sampleSize <-as.numeric(table(attrition1$Department))testResult <-prop.test(yes_counts, sampleSize)testResult```Our $p$-value `r testResult$p.value` is very low which means we can reject the default null hypothesis (the difference in the two samples is 0). It is also giving us the proportions for the samples in the two different departments: `r testResult$estimate` . And the confidence intervals is giving us the difference in the proportion of the two departments (as in prop1 - prop2). In this case being negative means that the second department has a higher attrition rate. The order of the variables is as entered in the vectors, so if *yes_counts* had *Research_Development* first and *Sales* second, prop1 will be for *Research_Development* and prop2 for *Sales*:::## Two-samples of categorical variables{#sec-chiSquare}Two samples of a single categorical variable can be compared with the $x^2-test$ for homogeneity (Chi-squared). Under the hood, this is just a goodness-of-fit test of the hypothesis that the observed proportions in one of the samples are the same as those in the pooled sample.### Testing homogeneityThe $\chi^2$ test of homogeneity test the null hypothesis that the distribution of a categorical variable is the same for several populations. It assumes that the samples are drawn independently within and across populations.See how we can apply this logic to our Titanic survival data:|| First | Second | Third | Crew ||----------|-------|--------|-------|------|| Survived | 202 | 118 | 178 | 215 || Died | 123 | 167 | 528 | 698 |Note that in this case we are not sampling from a population. The data are not a random sample of the people on board, rather the data represent the whole population. Son in this case the chance process resulting in survival or death is not the sampling, but the result of random events occurring when looking for a way out of the ship, like getting into a life boat or into the water, being rescued from the water on time, etc. Then the 325 observations of first class passengers represent 325 independent draws from a probability histogram that gives a certain chance for survival. The 285 observations about second class passengers are drawn from the probability histogram for second class passengers, which may be different. The null hypothesis says that the probability of survival is the same for all four probability histograms. According to this hypothesis we can calculate the probability of survival by pooling all the data = $\frac{713}{2229}=32\%$ with this number we can calculate the expected number of surviving passengers for each class:| Surviving | First | Second | Third | Crew ||-----------|-------|--------|-------|-------|| Observed | 202 | 118 | 178 | 215 || Expected | 104.0 | 91.2 | 225.8 | 292.1 || Died | First | Second | Third | Crew ||----------|-------|--------|-------|-------|| Expected | 221.0 | 193.8 | 480.1 | 620.8 || Observed | 123 | 167 | 528 | 698 |Now we can compare our *chi statistic* as we learned, using all the differences between expected and observed values:$$\chi^2 =\sum_{categories} \frac{(observed-expected)^2}{expected}$$$$=\frac{(202-104)^2}{104}+\frac{(123-221)^2}{221}+\cdots =192 $$In this case our degrees of freedom are calculated like this: (4-1)\*(2-1) = 3 where 4 is for the number of categories, and 2 is for the two rows of results we are dealing with (surviving and died)In our case our $p$ value will be extremely small, suggesting we should reject the null hypothesis that all ticket classes had the same possibility of survival.```{r}x <-192.2df <-3# Calculate the p-valuep_value <-pchisq(x, df, lower.tail =FALSE)# Print the p-valuep_value```The test in `r`:```{r}# Create the contingency tabletitanic_data <-matrix(c(202, 118, 178, 215, 123, 167, 528, 698), nrow =2, byrow =TRUE,dimnames =list(Survival =c("Survived", "Died"),Class =c("First", "Second", "Third", "Crew")))# Print the contingency tableprint(titanic_data)# Perform the chi-squared test of homogeneitychi_squared_test <-chisq.test(titanic_data)# Print the test resultsprint(chi_squared_test)```::: exercise-boxExercise:*Use the substance_abuse data set to decide if the distribution of mental health diagnosis is the same for those with and without at least one psychiatric admission.*```{r}as.data.frame(report(substance_abuse$MHDx))as.data.frame(report(substance_abuse$PsychAdmit))substance_abuse <- substance_abuse %>%mutate(previousAdmit =ifelse(substance_abuse$PsychAdmit ==0, FALSE, TRUE))t<-table( substance_abuse$previousAdmit,substance_abuse$MHDx)tchisq.test(t)```In this case according to our $p$-value we cannot say that patients that had at least one psychiatric admission in the past have the same proportion of diagnosis than patients without any admission.:::Chi squared test for homogeneity is not saying anything specific about any of these diagnosis, it is simply saying that the distribution of these diagnosis between those two groups (admitted vs not admitted) is not the same.Is the distribution of SUDx the same for both men and women in our Substance Abuse data?```{r}head(substance_abuse)as.data.frame(report(substance_abuse$SUDx))``````{r}t <-table(substance_abuse$Gender, substance_abuse$SUDx)head(t)```Our null hypothesis is that the distribution between gender is the same.```{r}result <-chisq.test(t)report(result)```In this case the $p$-value of the test is over 0.05 so that indicates that there is no significance difference by sex for the SUDx variable, so men and women are abusing the different categories of substances in the same proportions. The df in our results are the degrees of freedom that is the number of categories minus one.In reality this is the same as doing a Goodness of fit test as we saw before, if we remember it was $chisq.test(p_s , p)$ where $p_s$ was the proportions of the different categories in our sample and $p$ the proportion of the population we wanted to test against. Back to our example what we are measuring here is the proportion of women or men against the proportion of the full sample, that will tell us if there is a difference between men and women, so we can also write the test like this getting similar results:```{r}sudxProp <-table(substance_abuse$SUDx)/sum(table(substance_abuse$SUDx))women <- t[1,]result<-chisq.test(women, p= sudxProp)report(result)```### Independence Testing of Categorical VariablesWe want to answer this question: Is gender (M/F) related to voting preference (liberal/conservative)? Now we have two categorical variables: gender and voting preference. The null hypothesis is that the two variables are independent. The alternative hypothesis is that there is some kind of association.The tool we are going to use is the chi-square test ( $x^2$ ) for independence. This test can be used to test whether two categorical variables are independent or not. That is, whether knowledge about one provides information about the other. The math is exactly the same as for the chi-square test for homogeneity, hence, so is the sintax in most statistical packages, including R, but with a few things to bear in mind:- Technically, the null hypothesis of a $x^2$ test for independence is that in the larger population, the probability of an observation being in any specific pair of categories is equal to the product of the probabilities of being in each of the individual categories.- This is another omnibus test: it says nothing about particular categories.- Larger cell counts are better, as usual. Do not do this test if any of the counts are less than 5.We are going to use our product_rating dataset to see the relationship between age groups and the product the user purchased.first we can just see the contingency table for those two variables```{r}file <- here::here("data", "product_ratings.xlsx")product_ratings <-read_excel(file)t <-table(product_ratings$age, product_ratings$product)t``````{r}chisq.test(t)```The low $p$-value from this test is indicating that the theory that these two categorical variables are independent is implausible.::: {.callout-orange appearance="simple" icon="false"}Chi-Squared is used for testing the independence of categorical variables. It compares observed frequencies in a contingency table to expected frequencies under independence.:::::: exercise-boxExercises*Referring to the substance_abuse data set:*- *Is there any evidence of an association between Substance Use Diagnosis and Mental Health Diagnosis among patients receiving an intervention?*```{r}treatment <-filter(substance_abuse, Program =="Intervention")t<-table (treatment$SUDx, treatment$MHDx)chisq.test(t)```According to this result we cannot conclude that there is a relationship between these two variables.- *Is there any evidence of an association between SUDx and DLA_improvement among patients receiving an intervention?*```{r, fig.align="center"}ggplot(treatment, aes(x= SUDx, y = DLA_improvement))+geom_boxplot()testResult <-aov( DLA_improvement ~ SUDx, data = treatment)summary(testResult)```The difference is significant but not by large as the $p$ value shows that in 3% of the cases we could observe these data differences in a sample from the a population where their category means is the same.:::If we run a TukeyHSD test on these we can compare each pair individually and we find out that actually the only pair that shows a statistically significant difference is alcohol with opioids.```{r}TukeyHSD(testResult)```It is easy to confuse the testing for homogeneity and the testing for independence.### Comparing the different uses of the chi-square test:+-----------------+---------------------------------------------------------+-------------------------------------------------------------------------------------+|| sample | research question |+=================+=========================================================+=====================================================================================+| goodness of fit | single categorical variable. one sample | Are the counts of the different categories matching our expected results |+-----------------+---------------------------------------------------------+-------------------------------------------------------------------------------------+| homogeneity | single categorical variable measured on several samples | Are the groups homogeneous (have the same distribution of the categorical variable) |+-----------------+---------------------------------------------------------+-------------------------------------------------------------------------------------+| independence | two categorical variables measured on one sample | Are the two categorical variables independent? |+-----------------+---------------------------------------------------------+-------------------------------------------------------------------------------------+Examples:- we want to know if there is more births in different week days (Monday, Tuesday...). We record the week day of 300 births. What test should we use to know if there is a difference per day? --\> We would use chi-square test for goodness of fit.- A food delivery start-up decides to advertise its service by placing ads on web pages. They wonder whether the percentage of viewers who click on the ad changes depending on how often the viewers were shown the ad. They randomly select 100 viewers from among those who were shown the add once, 135 from among those who were shown the add twice, and 150 from among those who were shown the ad three times. --\> We would use chi-square test of homogeneity.- An airline wants to find out whether there is a connection between the customer's status in its frequent flyer program and the class of ticket that the customer buys. It samples 1,000 ticket records at random and for each ticket notes the status level ('none', 'silver', 'gold') and the ticket class ('economy', 'business','first'). --\> chi-square test of independence- A county wants to check whether the racial composition of the teachers in the county corresponds to that of the population in the county. It samples 500 teachers at random and wants to compare that sample with the census numbers about the racial groups in that county. --\> chi-square test for goodness of fit# Permutation TestsSuppose we have a situation in which none of the standard mathematical statistical approximations apply. We have computed a summary statistic, such as the difference in mean, but do not have a useful approximation, such as that provided by the CLT. In practice, we do not have access to all values in the population so we can't perform a simulation. Permutation tests can be useful in these scenarios.We are back to the scenario were we only have 10 measurements for each group.```{r, fig.align='center'}dat=read.csv("data/femaleMiceWeights.csv")control <-filter(dat,Diet=="chow") %>% dplyr::select(Bodyweight) %>% unlisttreatment <-filter(dat,Diet=="hf") %>% dplyr::select(Bodyweight) %>% unlistobsdiff <-mean(treatment)-mean(control)obsdiff```In previous sections, we showed parametric approaches that helped determine if the observed difference was significant. Permutation tests take advantage of the fact that if we randomly shuffle the cases and control labels, then the null is true. So we shuffle the cases and control labels and assume that the ensuing distribution approximates the null distribution. Here is how we generate a null distribution by shuffling the data 1,000 times:```{r, fig.align='center'}N <-12avgdiff <-replicate(1000, { all <-sample(c(control,treatment)) newcontrols <- all[1:N] newtreatments <- all[(N+1):(2*N)]return(mean(newtreatments) -mean(newcontrols))})hist(avgdiff)abline(v=obsdiff, col="red", lwd=2)```How many of the null means are bigger than the observed value? That proportion would be the $p$-value for the null. We add a 1 to the numerator and denominator to account for misestimation of the $p$-value. The $p$-value here is calculated directly as the percentage of values higher than our number of interest.```{r}#the proportion of permutations with larger difference(sum(abs(avgdiff) >abs(obsdiff)) +1) / (length(avgdiff) +1)```Now let's repeat this experiment for a smaller dataset. We create a smaller dataset by sampling:```{r, fig.align='center'}N <-5control <-sample(control,N)treatment <-sample(treatment,N)obsdiff <-mean(treatment)-mean(control)obsdiffavgdiff <-replicate(1000, { all <-sample(c(control,treatment)) newcontrols <- all[1:N] newtreatments <- all[(N+1):(2*N)]return(mean(newtreatments) -mean(newcontrols))})hist(avgdiff)abline(v=obsdiff, col="red", lwd=2)```Now the observed difference is not significant using this approach. Keep in mind that there is no theoretical guarantee that the null distribution estimated from permutations approximates the actual null distribution. For example, if there is a real difference between the populations, some of the permutations will be unbalanced and will contain some samples that explain this difference. This implies that the null distribution created with permutations will have larger tails than the actual null distribution. This is why permutations result in conservative $p$-values. For this reason, when we have few samples, we can't do permutations.Note also that permutations tests still have assumptions: samples are assumed to be independent and "exchangeable". If there is hidden structure in your data, then permutation tests can result in estimated null distributions that underestimate the size of tails because the permutations may destroy the existing structure in the original data.# Multiple test correctionsThe **Bonferroni correction** is a statistical method used to address the problem of multiple comparisons. When you perform multiple hypothesis tests, the chance of making a Type I error (false positive) increases. The Bonferroni correction helps control this by adjusting the significance level.What we do is, if there are m test, we multiply the $p$-values by m. This correction makes sure that P(any of the m test rejects in error) is still smaller than 5%. The Bonferroni correction is often very restrictive. It guards against having even one false positive among the m test. As a consequence the adjusted $p$-values may not be significant any more even if a noticeable effect is present, so it will only work if you don't have a large number of test.Alternatively if the number of tests is large, it is better to use the **False Discovery Proportion (FDP)**:$$FDP= \frac{number\ of\ false\ discoveries}{total\ number\ of\ discoveries}$$where a discovery occurs when a test rejects the null hypothesis. If no hypothesis are rejected, FDP is defined to be 0.The **False Discovery Rate (FDR)** is the expected proportion of false discoveries among the discoveries. One common method to control the FDR is the Benjamini-Hochberg procedure. This method adjust the $p$-values to control the FDR at a desired alpha level. The procedure is like this:1. Sort the $p$-values in ascending order. $p_{(1)}, p_{(2)}, \ldots, p_{(m)}$2. Find the largest k value such as $p_k=\frac{k}{m}\alpha$3. Reject the null hypothesis for all $p_i$ where $i\leq k$In r:```{r}# Example p-valuesp_values <-c(0.01, 0.04, 0.03, 0.002, 0.05, 0.20)# Apply the Benjamini-Hochberg procedurep.adjusted <-p.adjust(p_values, method ="BH")print(p.adjusted)```This will give you the adjusted $p$-values, and you can compare them to your desired FDR level to decide which hypotheses to reject.The **validation set approach**. With this approach you split your data set into two parts, one is a model building set and a validation set before the analysis. You may use data snooping (multiple testing) in the first set of data to find something interesting. And then you test this hypothesis on the validation set..# Choosing the right test for your dataSource: [Bevans, R. (2023, June 22). Choosing the Right Statistical Test Types & Examples. Scribbr. Retrieved November 18, 2024](https://www.scribbr.com/statistics/statistical-tests/)### **Statistical assumptions**Statistical tests make some common assumptions about the data they are testing:1. **Independence of observations** (a.k.a. no autocorrelation): The observations/variables you include in your test are not related (for example, multiple measurements of a single test subject are not independent, while measurements of multiple different test subjects are independent).2. **Homogeneity of variance**: the variance within each group being compared is similar among all groups. If one group has much more variation than others, it will limit the test's effectiveness.3. **Normality of data**: the data follows a normal distribution (a.k.a. a bell curve). This assumption applies only to quantitative data.If your data do not meet the assumptions of normality or homogeneity of variance, you may be able to perform a **nonparametric statistical test**, which allows you to make comparisons without any assumptions about the data distribution.If your data do not meet the assumption of independence of observations, you may be able to use a test that accounts for structure in your data (repeated-measures tests or tests that include blocking variables).### **Types of variables**The types of variables you have usually determine what type of statistical test you can use.**Quantitative variables** represent amounts of things (e.g. the number of trees in a forest). Types of quantitative variables include:- **Continuous** (aka ratio variables): represent measures and can usually be divided into units smaller than one (e.g. 0.75 grams).- **Discrete** (aka integer variables): represent counts and usually can't be divided into units smaller than one (e.g. 1 tree).**Categorical variables** represent groupings of things (e.g. the different tree species in a forest). Types of categorical variables include:- **Ordinal**: represent data with an order (e.g. rankings).- **Nominal**: represent group names (e.g. brands or species names).- **Binary**: represent data with a yes/no or 1/0 outcome (e.g. win or lose).Choose the test that fits the types of predictor and outcome variables you have collected (if you are doing an experiment, these are the independent and dependent variables). Consult the tables below to see which test best matches your variables.## **Choosing a parametric test: regression, comparison, or correlation**Parametric tests usually have stricter requirements than nonparametric tests, and are able to make stronger inferences from the data. They can only be conducted with data that adheres to the common assumptions of statistical tests.The most common types of parametric test include regression tests, comparison tests, and correlation tests.### **Regression tests**Regression tests look for **cause-and-effect relationships**. They can be used to estimate the effect of one or more continuous variables on another variable.### **Comparison tests**Comparison tests look for **differences among group means**. They can be used to test the effect of a categorical variable on the mean value of some other characteristic.T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women). ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults).+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+|| Predictor Variable | Outcome Variable | Example |+====================+========================+========================================+======================================================================================================================+| Paired t-test | Categorical | Quantitative | What is the effect of two different test prep programs on the average exam scores for students from the same class?\ ||||| (Predictor: test program) ||| 1 predictor | Groups come from the same population ||+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+| Independent t-test | Categorical | Quantitative | What is the difference in average exam scores for students from two different schools? |||||||| 1 predictor | Groups come from different populations | (Predictor: school A/B) |+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+| ANOVA | Categorical | Quantitative | What is the difference in average pain levels among post-surgical patients given three different treatments? |||||||| 1 or more predictor | 1 outcome ||+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+| MANOVA | Categorical | Quantitative | What is the effect of flower species on petal length, petal width and stem length? |||||||| 1 or more predictor | 2 or more outcomes ||+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+| Pearson's r | 2 continuous variables | 2 continuous variables | How are latitude and temperature related? |+--------------------+------------------------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------+## Flowchart for parametric tests```{r, echo=FALSE}library(DiagrammeR)f<- DiagrammeR::grViz("digraph flowchart { graph [layout = dot, rankdir = TB] # TB stands for Top to Bottom node [shape = rectangle, style = filled, color = lightblue] A [label = 'Predictor variable: categorical or quantitative?', color = lightblue, fontcolor = black, fontsize = 12] B [label = 'Categorical', color = lightgrey, fontcolor = black, fontsize = 12] C [label = 'Quantitative', color = lightgray, fontcolor = black, fontsize = 12] D [label = 'Outcome variable: categorical or quantitative?', color = lightblue, fontcolor = black, fontsize = 12] E [label = 'Categorical',color = lightgray, fontcolor = black, fontsize = 12] F [label = 'Quantitative', color = lightgray, fontcolor = black, fontsize = 12] G [label = 'Choose a non-parametric test', color = lightpink, fontcolor = black, fontsize = 12] H [label = 'Do a comparison of means test', color = lightpink, fontcolor = black, fontsize = 12] I [label = 'How many groups are being compared?', color = lightblue, fontcolor = black, fontsize = 12] J [label = 'Two', color = lightgray, fontcolor = black, fontsize = 12] K [label = 'More than two', color = lightgray, fontcolor = black, fontsize = 12] L [label = 'How many outcome variables?', color = lightblue, fontcolor = black, fontsize = 12] M [label = 'One', color = lightgray, fontcolor = black, fontsize = 12] N [label = 'More than one', color = lightgray, fontcolor = black, fontsize = 12] O [label = 'T-test', color = lightpink, fontcolor = black, fontsize = 12] P [label = 'ANOVA', color = lightpink, fontcolor = black, fontsize = 12] Q [label = 'MANOVA', color = lightpink, fontcolor = black, fontsize = 12] R [label = 'Logistic regression', color = lightpink, fontcolor = black, fontsize = 12] S [label = 'How many predictor variables?', color = lightblue, fontcolor = black, fontsize = 12] T [label = 'One', color = lightgray, fontcolor = black, fontsize = 12] U [label = 'More than one', color = lightgray, fontcolor = black, fontsize = 12] V [label = 'Simple regression', color = lightpink, fontcolor = black, fontsize = 12] W [label = 'Multiple regression', color = lightpink, fontcolor = black, fontsize = 12] A -> B A -> C B -> D C -> S D -> E D -> F E -> G F -> H H -> I I -> J I -> K J -> O K -> L L -> M L -> N M -> P N -> Q C -> R S -> T S -> U T -> V U -> W}")f```::: {.callout-orange appearance="simple" icon="false"}### Parametric TestsParametric tests are statistical tests that make certain assumptions about the parameters of the population distribution from which the sample is drawn. These assumptions typically include:- **Normality**: The data follows a normal distribution.- **Homogeneity of variance**: The variances within each group being compared are similar.- **Independence**: The observations are independent of each other.Because of these assumptions, parametric tests are generally more powerful and can make stronger inferences about the population. Some common parametric tests include:- **T-tests**: Used to compare the means of two groups.- **ANOVA (Analysis of Variance)**: Used to compare the means of three or more groups.- **Regression Analysis**: Used to examine the relationship between one or more predictor variables and an outcome variable.### Non-Parametric TestsNon-parametric tests, on the other hand, do not make strict assumptions about the population parameters. They are more flexible and can be used when the assumptions of parametric tests are not met, such as when the data does not follow a normal distribution or when sample sizes are small. Non-parametric tests are also useful for ordinal data or data with outliers.Some common non-parametric tests include:- **Mann-Whitney U Test**: Used to compare the medians of two independent groups.- **Wilcoxon Signed-Rank Test**: Used to compare the medians of two related groups.- **Kruskal-Wallis Test**: Used to compare the medians of three or more independent groups.- **Spearman's Rank Correlation**: Used to assess the relationship between two ordinal variables.### Key Differences- **Assumptions**: Parametric tests assume specific population parameters, while non-parametric tests do not.- **Data Type**: Parametric tests are typically used for interval or ratio data, whereas non-parametric tests can be used for ordinal data or data that does not meet parametric assumptions.- **Power**: Parametric tests are generally more powerful when their assumptions are met, meaning they are more likely to detect a true effect. Non-parametric tests are more robust and flexible but may be less powerful.### Example ScenarioImagine you want to compare the test scores of students from two different schools. If the test scores are normally distributed and the variances are similar, you would use a parametric test like the independent t-test. However, if the test scores are not normally distributed or have outliers, you might use a non-parametric test like the Mann-Whitney U Test.:::## Choosing a non parametric test+---------------------------------+--------------------+----------------------------------------+---------------------+|| \ | Outcome variable | Use in place of... ||| Predictor variable |||+=================================+:==================:+:======================================:+:===================:+| Spearman's *r* | Quantitative | Quantitative | Pearson's *r* |+---------------------------------+--------------------+----------------------------------------+---------------------+| Chi square test of independence | Categorical | Categorical | Pearson's *r* |+---------------------------------+--------------------+----------------------------------------+---------------------+| Sign test | Categorical | Quantitative | One-sample *t*-test |+---------------------------------+--------------------+----------------------------------------+---------------------+| Kruskal--Wallis *H* | Categorical | Quantitative | ANOVA |||||||| 3 or more groups |||+---------------------------------+--------------------+----------------------------------------+---------------------+| ANOSIM | Categorical | Quantitative | MANOVA |||||||| 3 or more groups | 2 or more outcome variables ||+---------------------------------+--------------------+----------------------------------------+---------------------+| Wilcoxon Rank-Sum test | Categorical | Quantitative | Independent t-test |||||||| 2 groups | groups come from different populations ||+---------------------------------+--------------------+----------------------------------------+---------------------+| Wilcoxon Signed-rank test | Categorical | Quantitative | Paired t-test |||||||| 2 groups | groups come from the same population ||+---------------------------------+--------------------+----------------------------------------+---------------------+